REVIVING AUTOENCODER PRETRAINING

Abstract

The pressing need for pretraining algorithms has been diminished by numerous advances in terms of regularization, architectures, and optimizers. Despite this trend, we re-visit the classic idea of unsupervised autoencoder pretraining and propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. We establish links between SVD and pretraining and show how it can be leveraged for gaining insights about the learned structures. Most importantly, we demonstrate that our approach yields an improved performance for a wide variety of relevant learning and transfer tasks ranging from fully connected networks over ResNets to GANs. Our results demonstrate that unsupervised pretraining has not lost its practical relevance in today's deep learning environment. 𝒅 𝒎 𝟏 𝒗 𝒗 𝒅 … … (50 images) (50 images) (1,0) (0,1) 𝒅 𝐋𝐏𝐈𝐏𝐒𝐒𝐭𝐝 𝐋𝐏𝐈𝐏𝐒𝐎𝐫𝐭 𝐋𝐏𝐈𝐏𝐒𝑹𝑹 𝑏 𝑃𝑒𝑎𝑘 𝑇𝑒𝑠𝑡 𝑤𝑖𝑡ℎ 𝐵𝑁 𝑎𝑛𝑑 𝑅𝑒𝐿𝑈 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 0.

1. INTRODUCTION

While approaches such as greedy layer-wise autoencoder pretraining (Bengio et al., 2007; Vincent et al., 2010; Erhan et al., 2010) arguably paved the way for many fundamental concepts of today's methodologies in deep learning, the pressing need for pretraining neural networks has been diminished in recent years. This was primarily caused by numerous advances in terms of regularization (Srivastava et al., 2014; Hanson & Pratt, 1989; Weigend et al., 1991) , network architectures (Ronneberger et al., 2015; He et al., 2016; Vaswani et al., 2017) , and improved optimization algorithms (Kingma & Ba, 2014; Loshchilov & Hutter, 2017; Reddi et al., 2019) . Despite these advances, training deep neural networks that generalize well to a wide range of previously unseen tasks remains a fundamental challenge (Neyshabur et al., 2017; Kawaguchi et al., 2017; Frankle & Carbin, 2018) . Inspired by techniques for orthogonalization (Ozay & Okatani, 2016; Jia et al., 2017; Bansal et al., 2018) , we re-visit the classic idea of unsupervised autoencoder pretraining in the context of reversible network architectures. Hence, we propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. A key insight is that there is no need for "greediness", i.e., layer-wise decompositions of the network structure, and it is additionally beneficial to take into account a specific problem domain at the time of pretraining. We establish links between singular value decomposition (SVD) and pretraining, and show how our approach yields an embedding of problem-aware dominant features in the weight matrices. An SVD can then be leveraged to conveniently gain insights about learned structures. Most importantly, we demonstrate that the proposed pretraining yields an improved performance for a variety of learning and transfer tasks. Our formulation incurs only a very moderate computational cost, is very easy to integrate, and widely applicable. The structure of our networks is influenced by invertible network architectures that have received significant attention in recent years (Gomez et al., 2017; Jacobsen et al., 2018; Zhang et al., 2018a) . However, instead of aiming for a bijective mapping that reproduces inputs, we strive for learning a general representation by constraining the network to represent an as-reversible-as-possible process for all intermediate layer activations. Thus, even for cases where a classifier can, e.g., rely on color for inference of an object type, the model is encouraged to learn a representation that can recover the input. Hence, not only the color of the input should be retrieved, but also, e.g., its shape. In contrast to most structures for invertible networks, our approach does not impose architectural restrictions. We demonstrate the benefits of our pretraining for a variety of architectures, from fully connected layers to convolutional neural networks (CNNs), over networks with and without batch normalization, to GAN architectures. We discuss other existing approaches and relate them to the proposed method in the appendix. Below, we will first give an overview of our formulation and its connection to singular values, before evaluating our model in the context of transfer learning. For a regular, i.e., a non-transfer task, the goal usually is to train a network that gives optimal performance for one specific goal. During a regular training run, the network naturally exploits any observed correlations between input and output distribution. An inherent difficulty in this setting is that typically no knowledge about the specifics of the new data and task domains is available when training the source model. Hence, it is common practice to target broad and difficult tasks hoping that this will result in features that are applicable in new domains (Zamir et al., 2018; Gopalakrishnan et al., 2017; Ding et al., 2017) . Motivated by autoencoder pretraining, we instead leverage a pretraining approach that takes into account the data distribution of the inputs. We demonstrate the gains in accuracy for original and new tasks below for a wide range of applications, from image classification to data-driven weather forecasting.

2. METHOD

With state-of-the-art methods, there is no need for breaking down the training process into single layers. Hence, we consider approaches that target whole networks, and especially orthogonalization regularizers as a starting point (Huang et al., 2018) . Orthogonality constraints were shown to yield improved training performance in various settings (Bansal et al., 2018) , and can be formulated as: L ort = n m=1 M T m M m -I 2 F , i.e., enforcing the transpose of the weight matrix M m ∈ R s out m ×s in m for all layers m to yield its inverse when being multiplied with the original matrix. I denotes the identity matrix with I = (e 1 m , ...e s in m m ), e j m denoting the j th column unit vector. Minimizing equation 1, i.e. M T m M m -I = 0 is mathematically equivalent to: M T m M m e j m -e j m = 0, j = 1, 2, ..., s in m , with rank(M T m M m ) = s in m , and e j m as eigenvectors of M T m M m with eigenvalues of 1. This formulation highlights that equation 2 does not depend on the training data, and instead only targets the content of M m . Inspired by the classical unsupervised pretraining, we re-formulate the orthogonality constraint in a data-driven manner to take into account the set of inputs D m for the current layer (either activation from a previous layer or the training data D 1 ), and instead minimize L RR = n m=1 (M T m M m d i m -d i m ) 2 = n m=1 ((M T m M m -I)d i m ) 2 , with the largest c m coefficients. As V m is typically severely limited in terms of its representational capabilities by the number of adjustable weights in a network, it needs to focus on the most important eigenvectors in terms of c m in order to establish a small distance to D m c m . Thus, features that appear multiple times in the input data with a corresponding factor in c m will more strongly contribute to minimizing L RRm . To summarize, V m is driven towards containing r orthogonal vectors w i m that represent the most frequent features of the input data, i.e., the dominant features. Additionally, due to the column vectors of V m being mutually orthogonal, M m is encouraged to extract different features from the input. By the sake of being distinct and representative for the data set, these features have the potential to be useful for new inference tasks. The feature vectors embedded in M m can be extracted from the network weights in practical settings, as we will demonstrate below. Our goal with the reverse pass is to transpose all operations of the forward pass to obtain identical intermediate activations between the layers with matching dimensionality. We can then constrain the intermediate results of each layer of the forward pass to match the results of the backward pass, as illustrated in figure 2 . Unlike greedy layer-wise autoencoder pretraining, which trains each layer separately and only constrains d 1 and d 1 , we jointly train all layers and constrain all intermediate results. Due to the symmetric structure of the two passes, we can use a simple L 2 difference to drive the network towards aligning the results:

Realization in Neural Networks

L RR = n m=1 λ m d m -d m 2 F . ( ) Here d m denotes the input of layer m in the forward pass and d m the output of layer m for the reverse pass. λ m denotes a scaling factor for the loss of layer m, which, however, is typically constant in our tests across all layers. Note that with our notation, d 1 and d 1 refer to the input data, and the reconstructed input, respectively. Next, we show how this setup realizes the regularization from equation 3. For clarity, we use a fully connected layer with bias. In a neural network with n hidden layers, the forward process for a layer m is given by  d m+1 = M m d m + b m ,, d m = M T m (d m+1 -b m ), which yields d m -d m 2 F = M T m M m d m -d m 2 F . When this difference is minimized via equation 7, we obtain activated intermediate content during the reverse pass that reconstructs the values computed in the forward pass, i.e. d m+1 = d m+1 holds. As in equation 10 the reverse pass activation d m depends on d m+1 , this formulation yields a full reverse pass from output to input, which we use for most training runs below. In this case d m = M T m (d m+1 -b m ) = M T m (d m+1 -b m ) = M T m M m d m , which is consistent with equation 3, and satisfies the original constraint M T m M m d m -d m = 0. This version is preferable if a unique path from output to input exists. For architectures where the path is not unique, e.g., in the presence of additive residual connections, we use a local formulation d m = M T m (d m+1 -b m ), which employs d m+1 for jointly constraining all intermediate activations in the reverse pass. Up to now, the discussion focused on simplified neural networks without activation functions or extensions such as batch normalization (BN). While we leave incorporating such extensions for future work, our experiments consistently show that the inherent properties of our pretraining remain valid: even with activations and BN, our approach successfully extracts dominant structures and yields improved generalization. In the appendix, we give details on how to ensure that the latent space content for forward and reverse pass is aligned such that differences can be minimized. To summarize, we realize the loss formulation of equation 7 to minimize n m=1 ((M T m M m -I)d m ) 2 without explicitly having to construct M T m M m . Following the notation above, we will refer to networks trained with the added reverse structure and the additional loss terms as RR variants. We consider two variants for the reverse pass: a local pretraining equation 10 using the datum d m+1 of a given layer, and a full version via equation 8 which uses d m+1 incoming from the next layer during the reverse pass. Embedding Singular Values Below, Std denotes a regular training run (in orange color in graphs below), while RR denotes our models (in green). Pre and Ort will denote regular autoencoder pretraining and orthogonality, respectively, while a subscript will denote the task variant the model was trained for, e.g., Std T for task T. While we typically use all layers of a network in the constraints, a reduced variant that we compare to below only applies the constraint for the input data, i.e., m=1. A network trained with this variant, denoted by RR 1 A , is effectively trained to only reconstruct the input. It contains no constraints for the inner activations and layers of the network. For the Ort models, we use the Spectral Restricted Isometry Property algorithm (Bansal et al., 2018) . We verify that the column vectors of V m of models from RR training contain the dominant features of the input with the help of a classification test, employing a single fully connected layer, i.e. d 2 = M 1 d 1 , with batch normalization and activation. To quantify this similarity, we compute an LPIPS distance (Zhang et al., 2018b) The results above experimentally confirm our formulation of the RR loss and its ability to extract dominant and generalizing structures from the training data. Next, we will focus on quantified metrics and turn to measurements in terms of mutual information to illustrate the behavior of our pretraining for deeper networks.

3. EVALUATION IN TERMS OF MUTUAL INFORMATION

As our approach hinges on the introduction of the reverse pass, we will show that it succeeds in terms of establishing mutual information (MI) between the input and the constrained intermediates inside a network. More The following tests employ networks with six fully connected layers with the objective to learn the mapping from 12 binary inputs to 2 binary output digits (Shwartz-Ziv & Tishby, 2017) , with results accumulated over five runs. We compare the versions Std A , Pre A , Ort A , RR A , and a variant of the latter: RR 1 A , i.e. a version where only the input d 1 is constrained to be reconstructed. While figure 4a ) visually summarizes the content of the MI planes, the graph in (b) highlights that training with the RR loss correlates input and output distributions across all layers: the cluster of green points in the center of the graph shows that all layers contain balanced MI between in-as well as output and the activations of each layer. RR 1 A fares slightly worse, while Std A and Ort A almost exclusively focus on the output with I(D m ; Y ) being close to one. Pre A instead only focuses on reconstructing inputs. Thus, the early layers cluster in the right-top corner, while the last layer I(D 7 ; Y ) fails to align with the outputs. Once we continue fine-tuning these models without regularization, the MI naturally shifts towards the output, as shown in figure 4 (c ). Here, RR AA outperforms the other models in terms of final performance. Likewise, RR AB performs best for a transfer task B with switched output digits, as shown in graph (d). The final performance for both tasks across all runs is summarized in figure 5 . These graphs visualize that the proposed pretraining succeeds in robustly establishing mutual information between inputs and targets across a full network, in addition to extracting reusable features. MI has received attention recently as a learning objective, e.g., in the form of the InfoGAN approach (Chen et al., 2016) for learning disentangled and interpretable latent representations. While MI is typically challenging to assess and estimate (Walters-Williams & Li, 2009), the results above show that our approach provides a straightforward and robust way for including it as a learning objective. In this way, we can, e.g., reproduce the disentangling results from (Chen et al., 2016) , which are shown in figure 1(c ). A generative model with our pretraining extracts intuitive latent dimensions for the different digits, line thickness, and orientation without any additional modifications of the loss function. The joint training of the full network with the proposed reverse structure, including non-linearities and normalization, yields a natural and intuitive decomposition.

4. EXPERIMENTAL RESULTS

We now turn to a broad range of network structures, i.e., CNNs, Autoencoders, and GANs, with a variety of data sets and tasks to show our approach succeeds in improving inference accuracy and generality for modern day applications and architectures. Transfer-learning Benchmarks We first evaluate our approach with two state-of-the-art benchmarks for transfer learning. The first one uses the texture-shape data set from (Geirhos et al., 2018) , which contains challenging images of various shapes combined with patterns and textures to be classified. The results below are given for 10 runs each. For the stylized data shown in figure 6 (a), the accuracy of Pre TS is low with 20.8%. This result is in line with observations in previous work and confirms the detrimental effect of classical pretraining. Std TS yields a performance of 44.2%, and Ort TS improves the performance to 47.0%, while RR TS yields a performance of 54.7% (see figure 6b ). Thus, the accuracy of RR TS is 162.98% higher than Pre TS , 23.76% higher than Std TS , and 16.38% higher than Ort TS . To assess generality, we also apply the models to new data without It is worth pointing out that the additional constraints of our training approach lead to moderately increased requirements for memory and computations, e.g., 41.86% more time per epoch than regular training for the texture-shape test. On the other hand, it allows us to train smaller models: we can reduce the weight count by 32% for the texture-shape case while still being on-par with Ort TS in terms of classification performance. By comparison, regular layer-wise pretraining requires a significant overhead and fundamental changes to the training process. Our pretraining fully integrates with existing training methodologies and can easily be deactivated via λ m = 0. As a second test case, we use a CIFAR-based task transfer (Recht et al., 2019) that measures how well models trained on the original CIFAR 10, generalize to a new data set (CIFAR 10.1) collected according to the same principles as the original one. Here we use a Resnet110 with 110 layers and 1.7 million parameters, Due to the consistently low performance of the Pre models (Alberti et al., 2017) , we focus on Std, Ort and RR for this test case. In terms of accuracy across 5 runs, Ort C10 outperforms Std C10 by 0.39%, while RR C10 outperforms Ort C10 by another 0.28% in terms of absolute test accuracy (figure 7 ). This increase for RR training matches the gains reported for orthogonality in previous work (Bansal et al., 2018) , thus showing that our approach yields substantial practical improvements over the latter. It is especially interesting how well performance for CIFAR 10 translates into transfer performance for CIFAR 10.1. Here, RR C10 still outperforms Ort C10 and Std C10 by 0.22% and 0.95%, respectively. Hence, the models from our pretraining very successfully translate gains in performance from the original task to the new one, indicating that the models have successfully learned a set of more general features. To summarize, both benchmark cases confirm that the proposed pretraining benefits generalization. Generative Adversarial Models In this section, we employ our pretraining in the context of generative models for transferring from synthetic to real-world data from the ScalarFlow data set (Eckert et al., 2019) . As super-resolution task A, we first use a fully-convolutional generator network, adversarially trained with a discriminator network on the synthetic flow data. While regular pretraining is 𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑆𝑡𝑑 𝑅𝑅 𝑎 𝑇ℎ𝑟𝑒𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑓 𝑍500 (1) (2) (3) (1) (2) (3) (1) (2) (3) 𝑏 𝑇ℎ𝑟𝑒𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑓 𝑇850 𝑐 𝑇ℎ𝑟𝑒𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑓 𝑇2𝑀 For Pre A , the layer-wise nature of the pretraining severely limits its capabilities to learn the correct data distribution (Zhou et al., 2014) , leading to a low performance. We now mirror the generator model from the previous task to evaluate an autoencoder structure that we apply to two different data sets: the synthetic smoke data used for the GAN training (task B 1 ), and a real-world RGB data set of smoke clouds (task B 2 ). Thus both variants represent transfer tasks, the second one being more difficult due to the changed data distribution. The resulting losses, summarized in figure 8 b), show that RR training performs best for both autoencoder tasks: the L 2 loss of RR AB1 is 68.88% lower than Std AB1 , while it is 13.3% lower for task B 2 . The proposed pretraining also clearly outperforms the Pre variants. Within this series of tests, the RR performance for task B2 is especially encouraging, as this task represents a synthetic to real transfer. Weather Forecasting Pretraining is particularly attractive in situations where the amount of data for training is severely limited. Weather forecasting is such a case, as systematic and accurate data for many relevant quantities are only available for approximately 50 years. We target three-day forecasts of pressure, ground temperature, and mid-level atmospheric temperature based on a public benchmark dataset (Rasp et al., 2020) . This dataset contains worldwide observations from ERA5 (Hersbach et al., 2020) in six-hour intervals with a 5.625 • resolution. For the joint inference of atmospheric pressure (500 hPa geopotential, Z500), ground temperature (T2M), and atmospheric temperature (at 850 hPa, T850), we use a convolutional ResNet architecture with 19 residual blocks. As regular pretraining is not compatible with residual connections, we omit it here. We train a model regular model (about 6.36M trainable parameters) with data from 1979 to 2015, and compare its inference accuracy across all datapoints from years 2017 and 2018 to a similar model that employs our pretraining. While the regular model was trained for 25 epochs, the RR model was pretrained for 10 epochs and fine-tuned for another 15 epochs. Across all three physical quantities, the RR model clearly outperforms the regular model, as summarized in figure 1 (d) and figure 9 (details are given in the appendix). Especially for the latitude-weighted RMSE of Z500, it yields improvements of 5.5%. These improvements point to an improved generalization of the RR model via the pretraining and highlight its importance for domains where data is scarce.

5. CONCLUSIONS

We have proposed a novel pretraining approach inspired by classic methods for unsupervised autoencoder pretraining and orthogonality constraints. In contrast to the classical methods, we employ a constrained reverse pass for the full non-linear network structure and include the original learning objective. We have shown for a wide range of scenarios, from mutual information, over transfer learning benchmarks to weather forecasting, that the proposed pretraining yields networks with better generalizing capabilities. Our training approach is general, easy to integrate, and imposes no requirements regarding network structure or training methods. Most importantly, our results show that unsupervised pretraining has not lost its relevance in today's deep learning environment. As future work, we believe it will be exciting to evaluate our approach in additional contexts, e.g., for temporal predictions (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) , and for training explainable and interpretable models (Zeiler & Fergus, 2014; Chen et al., 2016; Du et al., 2018) .

A APPENDIX

To ensure reproducibility, source code and data for all tests will be published. Runtimes were measured on a machine with Nvidia GeForce GTX 1080 Ti GPUs and an Intel Core i7-6850K CPU. A.1 DISCUSSION OF RELATED WORK Greedy layer-wise pretraining was first proposed by (Bengio et al., 2007) , and influenced a large number of follow up works, providing a crucial method for enabling stable training runs of deeper networks. A detailed evaluation was performed by (Erhan et al., 2010) , also highlighting cases were it can be detrimental. These problems were later on also detailed in other works, e.g., by (Alberti et al., 2017) . The transferability of learned features was likewise a topic of interest for transfer learning applications (Yosinski et al., 2014) . Sharing similarities with our approach, (Rasmus et al., 2015) combined supervised and unsupervised learning objectives, but focused on denoising autoencoders and a layer-wise approach without weight sharing. We demonstrate the importance of leveraging state-of-the-art methods for training deep networks, i.e. without decomposing or modifying the network structure. This not only improves performance, but also very significantly simplifies the adoption of the pretraining pass in new application settings. Extending the classic viewpoint of unsupervised autoencoder pretraining, several prior methods employed "hard orthogonal constraints" to improve weight orthogonality via singular value decomposition (SVD) at training time (Huang et al., 2018; Jia et al., 2017; Ozay & Okatani, 2016) . Bansal et al. (Bansal et al., 2018) additionally investigated efficient formulations of the orthogonality constraints. In practice, these constraints are difficult to satisfy, and correspondingly only weakly imposed. In addition, these methods focus on improving performance for a known, given task. This means the training process only extracts features that the network considers useful for improving the performance of the current task, not necessarily improving generalization or transfer performance (Torrey & Shavlik, 2010) . While our approach shares similarities with SVD-based constraints, it can be realized with a very efficient L 2 -based formulation, and takes the full input distribution into account. Recovering all input information from hidden representations of a network is generally very difficult (Dinh et al., 2016; Mahendran & Vedaldi, 2016) , due to the loss of information throughout the layer transformations. In this context, (Tishby & Zaslavsky, 2015) proposed the information bottleneck principle, which states that for an optimal representation, information unrelated to the current task is omitted. This highlights the common specialization of conventional training approaches. Reversed network architectures were proposed in previous work (Ardizzone et al., 2018; Jacobsen et al., 2018; Gomez et al., 2017) , but mainly focus on how to make a network fully invertible via augmenting the network with special structures. As a consequence, the path from input to output is different from the reverse path that translates output to input. Besides, the augmented structures of these approaches can be challenging to apply to general network architectures. In contrast, our approach fully preserves an existing architecture for the backward path, and does not require any operations that were not part of the source network. As such, it can easily be applied in new settings, e.g., adversarial training (Goodfellow et al., 2014) . While methods using reverse connections were previously proposed (Zhang et al., 2018a; Teng & Choromanska, 2019) , these modules primarily focus on transferring information between layers for a given task, and on auto-encoder structures for domain adaptation, respectively.

A.2 PRETRAINING AND SINGULAR VALUE DECOMPOSITION

In this section we give a more detailed derivation of our loss formulation, extending Section 3 of the main paper. As explained there, our loss formulation aims for minimizing L RR = n m=1 (M T m M m d i m -d i m ) 2 , ( ) where M m ∈ R s out m ×s in m denotes the weight matrix of layer m, and data from the input data set D m is denoted by d i m ⊂ R s in m , i = 1, 2, ..., t. Here t denotes the number of samples in the input data set. (v f m ) T w h m v f m = w h m , and we trivially fulfill the constraint c m h ( r f =1 (v f m ) T w h m v f m -w h m ) = 0. However, due to r being smaller than q in practice, V m typically can not include all vectors from B m . Thus, if V m does not contain w h m , we have (v f m ) T w h m = 0 for every vector v f m in V m , which means r f =1 (v f m ) T w h m v f m = 0. ( ) As a consequence, the constraint equation 12 is only partially fulfilled: c m h ( r f =1 (v f m ) T w h m v f m -w h m ) = -c m h w h m . As the w h m have unit length, the factors c m determine the contribution of a datum to the overall loss. A feature w h m that appears multiple times in the input data will have a correspondingly larger factor in c m and hence will more strongly contribute to L RR . The L 2 formulation of equation 11 leads to the largest contributors being minimized most strongly, and hence the repeating features of the data, i.e., dominant features, need to be represented in V m to minimize the loss. Interestingly, this argumentation holds when additional loss terms are present, e.g., a loss term for classification. In such a case, the factors c m will be skewed towards those components that fulfill the additional loss terms, i.e. favor basis vectors w h m that contain information for about the loss terms. This, e.g., leads to clear digit structures being embedded in the weight matrices for the MNIST example below. In summary, to minimize L RR , V m is driven towards containing r orthogonal vectors w h m which represent the most frequent features of the input data, i.e. the dominant features. It is worth emphasizing that above B m is only an auxiliary basis, i.e., the derivation does not depend on any particular choice of B m .

A.3 EXAMPLES OF NETWORK ARCHITECTURES WITH PRETRAINING

While the proposed pretraining is significantly more easy to integrate into training pipelines than classic autoencoder pretraining, there are subtleties w.r.t. the order of the operations in the reverse pass that we clarify with examples in the following sections. To specify NN architectures, we use the following notation: C(k, l, q), and D(k, l, q) denote convolutional and deconvolutional operations, respectively, while fully connected layers are denoted with F (l), where k, l, q denote kernel size, output channels and stride size, respectively. The bias of a CNN layer is denoted with b. I/O(z) denote input/output, their dimensionality is given by z. I r denotes the input of the reverse pass network. tanh, relu, lrelu denote hyperbolic tangent, ReLU, and leaky ReLU activation functions (AF), where we typically use a leaky tangent of 0.2 for the negative half-space. U P , M P and BN denote 2× nearest-neighbor up-sampling, max pooling with 2 × 2 filters and stride 2, and batch normalization, respectively. Below we provide additional examples how to realize the pretraining loss L rr in a neural network architecture. As explained in the main document, the constraint equation 11 is formulated via all layers, i.e., the full loss variant. On the other hand, d m+1 yields a variant that ensures local reversibility of each layer, and yields a very similar performance, as we will demonstrate below. We employ this local loss for networks without a unique, i.e., bijective, connection between two layers. Intuitively, when inputs cannot be reliably reconstructed from outputs. L rr = n m=1 λ m d m -d m 2 F , Full Network Pretraining: An illustration of a CNN structure with AF and BN and a full loss is shown in figure 10 . While the construction of the reverse pass is straight-forward for all standard operations, i.e., fully connected layers, convolutions, pooling, etc., slight adjustments are necessary for AF and BN. It is crucial for our formulation that d m and d m contain the same latent space content in terms of range and dimensionality, such that they can be compared in the loss. Hence, we use the BN parameters and the AF of layer m -1 from the forward pass for layer m in the reverse pass. An example is shown in figure 14 . To illustrate this setup, we consider an example network employing convolutions with mixed AFs, BN, and MP. Let the network receives a field of 32 2 scalar values as input. From this input, 20, 40, and 60 feature maps are extracted in the first three layers. Besides, the kernel sizes are decreased from 5 × 5 to 3 × 3. To clarify the structure, we use ReLU activation for the first convolution, while the second one uses a hyperbolic tangent, and the third one a sigmoid function. With the notation outlined above, the first three layers of the network are I(32, 32, 1) = d 1 → C 1 (5, 20, 1) + b 1 → BN 1 → relu → d 2 → M P → C 2 (4, 40, 1) + b 2 → BN 2 → tanh → d 3 → M P → C 3 (3, 60, 1) + b 3 → BN 3 → sigm → d 4 → ... The reverse pass for evaluating the loss re-uses all weights of the forward pass and ensures that all intermediate vectors of activations, d m and d m , have the same size and content in terms of normalization and non-linearity. We always consider states after activation for L rr . Thus, d m denotes activations before pooling in the forward pass and d m contains data after up-sampling in the reverse pass, in order to ensure matching dimensionality. Thus, the last three layers of the reverse network for computing L rr take the form: ... → d 4 → -b 3 → D 3 (3, 40, 1) → BN 2 → tanh → U P → d 3 → -b 2 → D 2 (4, 20, 1) → BN 1 → relu → U P → d 2 → -b 1 → D 1 (5, 3, 1) → d 1 = O(32, 32, 1). (24) Here, the de-convolutions D x in the reverse network share weights with C x in the forward network. I.e., the 4 × 4 × 20 × 40 weight matrix of C 2 is reused in its transposed form as a 4 × 4 × 40 × 20 matrix in D 2 . Additionally, it becomes apparent that AF and BN of layer 3 from the forward pass do not appear in the listing of the three last layers of the reverse pass. This is caused by the fact that both are required to establish the latent space of the fourth layer. Instead, d 3 in our example represents the activations after the second layer (with BN 2 and tanh), and hence the reverse pass for d 3 reuses both functions. This ensures that d m and d m contain the same latent space content in terms of range and dimensionality, and can be compared in equation 22. For the reverse pass, we additionally found it beneficial to employ an AF for the very last layer if the output space has suitable content. E.g., for inputs in the form of RGB data we employ an additional activation with a ReLU function for the output to ensure the network generates only positive values. Localized Pretraining: In the example above, we use a full pretraining with d m+1 to reconstruct the activations d m . The full structure establishes a slightly stronger relationship among the loss terms of different layers, and allows earlier layers to decrease the accumulated loss of later layers. However, if the architecture of the original network makes use of operations between layers that are not bijective, we instead use the local loss. E.g., this happens for residual connections with an addition or non-invertible pooling operations such as max-pooling. In the former, we cannot uniquely determine the b, c in a = b + c given only a. And unless special care is taken (Bruna et al., 2013) , the source neuron of an output is not known for regular max-pooling operations. Note that our loss formulation has no problems with irreversible operations within a layer, e.g., most convolutional or fully-connected layers typically are not fully invertible. In all these cases the loss will drive the network towards a state that is as-invertible-as-possible for the given input data set. However, this requires a reliable vector of target activations in order to apply the constraints. If the connection betweeen layers is not bijective, we cannot reconstruct this target for the constraints, as in the examples given above. In such cases, we regard every layer as an individual unit to which we apply the constraints by building a localized reverse pass. For example, given a simple convolutional architecture with d 1 → C 1 (5, 20, 1) + b 1 = d 2 (25) in the forward pass, we calculate d 1 with (1,0,0,0,0,0,0,0,0,0) (0,1,0,0,0,0,0,0,0,0) (5) Overall, our experiments confirm the motivation of our pretraining formulation. They additionally show that employing an SVD of the network weights after our pretraining yields a simple and convenient method to give humans intuition about the features learned by a network. (d 2 -b 1 ) → D 1 (5, 3, 1) = d 1 ,

B MUTUAL INFORMATION

This gives details of the mutual information and disentangled representation tests from Sec. 4 of the main paper. Training Details: We use the same numerical studies as in (Shwartz-Ziv & Tishby, 2017) as task A, i.e. a regular feed-forward neural network with 6 fully-connected layers. The input variable X contains 12 binary digits that represent 12 uniformly distributed points on a 2D sphere. The learning objective is to discover binary decision rules which are invariant under O(3) rotations of the sphere. X has 4096 different patterns, which are divided into 64 disjoint orbits of the rotation group, forming a minimal sufficient partition for spherically symmetric rules (Kazhdan et al., 2003) . To generate the input-output distribution P (X, Y ), We apply the stochastic rule p(y = 1|x) = Ψ(f (x) -θ), (x ∈ X, y ∈ Y ), where Ψ is a standard sigmoidal function Ψ(u) = 1/(1 + exp(-γu)), following (Shwartz-Ziv & Tishby, 2017) . We then use a spherically symmetric real valued function of the pattern f (x), evaluated through its spherical harmonics power spectrum (Kazhdan et al., 2003) , and compare with a threshold which was selected to make p(y = 1) = x p(y = 1|x)p(x) ≈ 0.5, with uniform p(x). γ is high enough to keep the mutual information I(X; Y ) ≈ 0.99 bits. For the transfer learning task B, we reverse output labels to check whether the model learned specific or generalizing features. E.g., if the output is [0,1] in the original data set, we swap the entries to [1, 0] . 80% of the data (3277 data pairs) are used for training and rests (819 data pairs) are used for testing. For the MI comparison in Fig. 4 of the main paper, we discuss models before and after fine-tuning separately, in order to illustrate the effects of regularization. We include a model with greedy layerwise pretraining Pre, a regular model Std A , one with orthogonality constraints Ort A , and our regular model RR A , all before fine-tuning. For the model RR A all layers are constrained to be recovered in the backward pass. We additionally include the version RR 1 A , i.e. a model trained with only one loss term λ 1 |d 1 -d 1 | 2 , which means that only the input is constrained to be recovered. Thus, RR 1 A represents a simplified version of our approach which receives no constraints that intermediate results of the forward and backward pass should match. For Ort A , we used the Spectral Restricted Isometry Property (SRIP) regularization (Bansal et al., 2018) , L SRIP = βσ(W T W -I), ( ) where W is the kernel, I denotes an identity matrix, and β represents the regularization coefficient. σ(W ) = sup z∈R n ,z =0 Wz z denotes the spectral norm of W . As explained in the main text, all layers of the first stage, i.e. from RR A , RR 

Analysis of Results:

We first compare the version only constraining input and output reconstruction (RR 1 A ) and the full loss version RR A . Fig. 4 (b) of the main paper shows that all points of RR A are located in a central region of the MI place, which means that all layers successfully encode information about the inputs as well as the outputs. This also indicates that every layer contains a similar amount of information about X and Y , and that the path from input to output is similar to the path from output to input. The points of RR 1 A , on the other hand, form a diagonal line. I.e., this network has different amounts of mutual information across its layers, and potentially a very different path for each direction. This difference in behavior is caused by the difference of the constraints in these two versions: RR 1 A is only constrained to be able to regenerate its input, while the full loss for RR A ensures that the network learns features which are beneficial for both directions. This test highlights the importance of the constraints throughout the depth of a network in our formulation. In contrast, the I(X; D) values of later layers for Std A and Ort A exhibit small values (points near the left side), while I(D; Y ) is high throughout. This indicates that the outputs were successfully encoded and that increasing amounts of information about the inputs are discarded. Hence, more specific features about the given output data-set are learned by Std A and Ort A . This shows that both models are highly specialized for the given task, and potentially perform worse when applied to new tasks. Pre A only focuses on decreasing the reconstruction loss, which results in high I(X; D) values for early layers, and low I(D; Y ) values for later layers. During the fine-tuning phase for task A (i.e. regularizers being disabled), all models focus on the output and maximize I(D; Y ). There are differences in the distributions of the points along the y-axis, i.e., how much MI with the output is retained, as shown in Fig. 4 (c) of the main paper. For model RR AA , the I(D; Y ) value is higher than for Std AA , Ort AA , Pre AA and RR 1 AA , which means outputs of RR AA are more closely related to the outputs, i.e., the ground truth labels for task A. Thus, RR AA outperforms the other variants for the original task. In the fine-tuning phase for task B, Std AB stands out with very low accuracy in Fig. 5 of the main paper. This model from a regular training run has large difficulties to adapt to the new task. Pre A aims at extracting features from inputs and reconstructed them. Pre AB outperforms Std AB , which means features helpful for task B are extracted by Pre A , however, it's hard to guide the feature extracting process. Model Ort AB also performs worse than Std B . RR AB shows the best performance in this setting, demonstrating that our loss formulation yielded more generic features, improving the performance for related tasks such as the inverted outputs for B. We also analyze the two variants of our pretraining: the local variant lRR A and the full version RR A in terms of mutual information. figure 15 shows the MI planes for these two models, also showing RR 1 A for comparison. Despite the local nature of lRR A it manages to establish MI for the majority of the layers, as indicated by the cluster of layers in the center of the MI plane. Only the first layer moves towards the top right corner, and the second layer is affected slightly. I.e., these layers exhibit a stronger relationship with the distribution of the outputs. Despite this, the overall performance when fine-tuning or for the task transfer remains largely unaffected, e.g., the lRR A still clearly outperforms RR 1 A . This confirms our choice to use the full pretraining when network connectivity permits, and employ the local version in all other cases.

B.2 DISENTANGLED REPRESENTATIONS

The InfoGAN approach (Chen et al., 2016) demonstrated the possibility to control the output of generative models via maximizing mutual information between outputs and structured latent variables. However, mutual information is very hard to estimate in practice (Walters-Williams & Li, 2009) . The previous section and Fig. 4 (b) of the main paper demonstrated that models from our pretraining (both RR 1 A and RR A ) can increase the mutual information between network inputs and outputs. Intuitively, the pretraining explicitly constrains the model to recover an input given an output, which directly translates into an increase of mutual information between input and output distributions compared to regular training runs. For highlighting how our pretraining can yield disentangled representations (as discussed in the later paragraphs of Sec. 4 of the main text), we follow the experimental setup of InfoGAN (Chen et al., 2016) : the input dimension of our network is 74, containing 1 ten-dimensional category code c 1 , 2 continuous latent codes c 2 , c 3 ∼ U(-1, 1) and 62 noise variables. Here, U denotes a uniform distribution. Training Details: As InfoGAN focuses on structuring latent variables and thus only increases the mutual information between latent variables and the output, we also focus the pretraining on the corresponding latent variables. I.e., the goal is to maximize their mutual information with the output of the generative model. Hence, we train a model RR 1 for which only latent dimensions c 1 , c 2 , c 3 of the input layer are involved in the loss. We still employ a full reverse pass structure in the neural network architecture. c 1 is a ten-dimensional category code, which is used for controlling the output digit category, while c 2 and c 3 are continuous latent codes, to represent (previously unknown) key properties of the digits, such as orientation or thickness. Building relationship between c 1 and outputs is more difficult than for c 2 or c 3 , since the 10 different digit outputs need to be encoded in a sinlge continuous variable c 1 . Thus, for the corresponding loss term for c 1 we use a slightly larger λ factor (by 33%) than for c 2 and c 3 . Details of our results are shown in figure 16 . Models are trained using a GAN loss (Goodfellow et al., 2014) as the loss function for the outputs.

Analysis of Results:

In figure 16 we show additional results for the disentangling test case. It is visible that our pretraining of the RR 1 model yields distinct and meaningful latent space dimensions for c 1,2,3 . While c 1 controls the digit, c 2,3 control the style and orientation of the digits. For comparison, a regular training run with model Std does result in meaningful or visible changes when adjusting the latent space dimensions. This illustrates how strongly the pretraining can shape the 

C.2 GENERATIVE ADVERSARIAL MODELS

Training Details: The data set of smoke simulation was generated with a Navier-Stokes solver from an open-source library (Thuerey & Pfaff, 2018) . We generated 20 randomized simulations with 120 frames each, with 10% of the data being used for training. The low-resolution data were down-sampled from the high-resolution data by a factor of 4. Data augmentation, such as flipping and rotation was used in addition. As outlined in the main text, we consider building an autoencoder model for the synthetic data as task B 1 , and a generating samples from a real-world smoke data set as task B 2 . The smoke capture data set for B 2 contains 2500 smoke images from the ScalarFlow data set (Eckert et al., 2019) , and we again used 10% of these images as training data set. Task A: We use a fully convolutional CNN-based architecture for generator and discriminator networks. Note that the inputs of the discriminator contain high resolution data (64, 64, 1), as well as low resolution (16, 16, 1), which is up-sampled to (64, 64, 1) and concatenated with the high resolution data. In line with previous work (Xie et al., 2018) , RR A and Std A are trained with a It is worth noting that the reverse pass of the generator is also constrained when training Pre A and RR A . So both encoder and decoder are initialized with parameters from Pre A and RR A when training Pre AB1 and RR AB1 , respectively. This is not possible for a regular network like Std AB1 , as the weights obtained with a normal training run are not suitable to be transposed. Hence, the deconvolutions of Std AB1 are initialized randomly. Task B 2 : As the data set for the task B 2 is substantially different and contains RBG images (instead of single channel gray-scale images), we choose the following setups for the RR A , Pre A and Std A models: parameters from all six layers of Std A and RR A are reused for initializing decoder part of Std AB2 and RR AB2 , parameters from all six layers of Pre A are reused for initializing the encoder part of Pre AB2 . Specially, when initializing the last layer of Pre AB2 , Std AB2 and RR AB2 , we copy and stack the parameters from the last layer of Pre A , Std A and RR A , respectively, into three channels to match the dimenions of the outputs for task B 2 . Here, the encoder part of RR AB2 and the decoder of Pre AB2 are not initialized with RR A and Pre A , due to the significant gap between training data sets of task B 1 and task B 2 . Our experiments show that only initializing the decoder part of RR AB2 (avg. loss:1.56e7, std. dev.:3.81e5) outperforms initializing both encoder and decoder (avg. loss:1.82e7± 2.07e6), and only initializing the encoder part of Pre AB2 (avg. loss:4.41e7 ± 6.36e6) outperforms initializing both encoder and decoder (avg. loss:9.42e7 ± 6.11e7). We believe the reason is that initializing both encoder and decoder part makes it more difficult to adjust the parameters for new data set that is very different from the data set of the source task. Analysis of Results: Example outputs of Pre AB1 , Std AB1 and RR AB1 are shown in figure 18 . It is clearly visible that RR AB1 gives the best performance among these models. We similarly illustrate the behavior of the transfer learning task B 2 for images of real-world fluids. This example



Figure1: Our pretraining (denoted as RR) yields improvements for numerous applications: a): For difficult shape classification tasks, it outperforms existing approaches (StdTS, OrtTS, PreTS): the RRTS model classifies the airplane shape with significantly higher confidence. b): Our approach establishes mutual information between input and output distributions. c): For CIFAR 10 classification with a Resnet110, RRC10 yields substantial practical improvements over the state-of-the-art. d): Learned weather forecasting has strictly limited real-world data: our pretraining yields improvements for pressure (Z500, zoomed in regions shown above), atmospheric temperature (T850) as well as ground temperature (T2M).

Figure 2: A visual overview of the regular forward pass (blue) and the corresponding reverse pass for pretraining (yellow).

Figure 3: Column vectors of Vm for different trained models Std, Ort, Pre and RR for peaks. Input features clearly are successfully embedded in the weights of RR, as confirmed by the LPIPS scores. We employ a training data set constructed from two dominant classes (a peak in the top left, and bottom right quadrant, respectively), augmented with noise in the form of random scribbles. Based on the analysis above, we expect the RR training to extract the two dominant peaks during training. The LPIPS measurements confirm our SVD argumentation above, with average scores of 0.217±0.022 for RR, 0.319±0.114 for Pre, 0.495 ± 0.006 for Ort, and 0.500 ± 0.002 for Std. I.e., the RR model fares significantly better than the others. At the same time, the peaks are clearly visible for RR models, an example is shown in figure 3(b), while the other models fail to extract structures that resemble the input. Thus, by training with the full network and the original training objective, our pretraining yields structures that are interpretable and be inspected by humans.

formally, MI I(X; Y ) of random variables X and Y measures how different the joint distribution of X and Y is w.r.t. the product of their marginal distributions, i.e., the Kullback-Leibler divergence I(X; Y ) = D KL [P (X,Y ) ||P X P Y ]. (Tishby & Zaslavsky, 2015) proposed MI plane to analyze trained models, which show the MI between the input X and activations of a layer D m , i.e., I(X; D m ) and I(D m ; Y ), i.e., MI of layer D m with output Y . These two quantities indicate how much information about the in-and output distributions are retained at each layer, and we use them to show to which extent our pretraining succeeds at incorporating information about the inputs throughout training.

Figure 4: MI planes for different models: a) Visual overview of the contents. b) Plane for task A. Points on each line correspond to layers of one type of model. All points of RR A , are located in the center of the graph, while StdA and OrtA, exhibit large I(Dm; Y ), i.e., specialize on the output. PreA strongly focuses on reconstructing the input with high I(X; Dm) for early layers. c,d): After fine-tuning for A/B. The last layer D7 of RR AA and RR AB successfully builds the strongest relationship with Y , yielding the highest accuracy.

Figure 5: Performance for MI source and transfer tasks for the models of figure 4. Due to the large standard deviation of Ort, we show min/max value ranges. The dashed gray line and region show baseline accuracy for StdB. The top-left inset highlights the stability of the high accuracy results from RR training.

Figure 6: (a) Examples from texture-shape data set. (b, c, d) Texture-shape test accuracy comparisons of PreTS, OrtTS, StdTS and RR TS for different data sets.

Figure 9: Details of the three physical quantities of the weather forecasting test (full frames are shown in the appendix). As confirmed by the quantified results, RR predicts results closer to the reference.

with d m , and λ m denoting the vector of activated intermediate data in layer m from the forward pass, and a scaling factor, respectively. d m denotes the activations of layer m from the reverse pass. E.g., let L m () denote the operations of a layer m in the foward pass, and L m () the corresponding operations for the reverse pass. Then d m+1 = L m (d m ), and d m = L m (d m+1 ). When equation 22 is minimized, we obtain activated intermediate content during the reverse pass that reconstructs the values computed in the forward pass, i.e. d m+1 = d m+1 holds. Then d m can be reconstructed from the incoming activations from the reverse pass, i.e., d m+1 , or from the output of layer m, i.e., d m+1 . Using d m+1 results in a global coupling of input and output throughout amount of shared information between layer and amount of shared information between layer ℒ and 𝑋 of 𝑅𝑅 models exhibit strong MI with in-& output

Figure 10: Left: An overview of the regular forward pass (blue) and the corresponding reverse pass (yellow). The right side illustrates how parameters are reused for a convolutional layer. conv/deconv denote convolution/deconvolutional operations. fm and BNm denote the activation function and batch normalization of layer m, respectively. Shared kernel and bias are represented by Mm and bm.

Figure 14: Right singular vectors of M1 for peak tests with different network architectures. Across the three architectures, RR A successfully extracts dominant and salient features.

Figure 11: SVD of the M1 matrix for five tests with random two digit images as training data. LPIPS distances (Zhang et al., 2018b) of RR are consistently lower than Std and Ort. training data test data

Figure 13: Five repeated tests with the peak data shown in Sec. 3 of the main paper. RR A robustly extracts dominant features from the data set. The two singular vectors strongly resemble the two peak modes of the training data. This is confirmed by the LPIPS measurements.

MUTUAL INFORMATION TESTMutual information (MI) measures the dependence of two random variables, i.e., higher MI means that there is more shared information between two parameters. More formally, the mutual information I(X; Y ) of random variables X and Y measures how different the joint distribution of X and Y is w.r.t. the product of their marginal distributions, i.e., the Kullback-Leibler divergence I(X; Y ) = KL[P (X,Y ) ||P X P Y ], where KL denotes the Kullback-Leibler divergence. Let I(X; D m ) denote the mutual information between the activations of a layer D m and input X. Similarly I(D m ; Y ) denotes the MI between layer m and the output Y . We use MI planes in the main paper, which show I(X; D m ) and I(D m ; Y ) in a 2D graph for the activations of each layer D m of a network after training. This visualizes how much information about input and output distribution is retained at each layer, and how these relationships change within the network. For regular training, the information bottleneck principle(Tishby & Zaslavsky, 2015) states that layers contain more information about the input, i.e., show high values for I(X; D m ) and I(D m ; Y ). Hence in the MI plane visualizations, these layers are often visible at the top-right corner. Later layers typically share a large amount of information with the output after training, i.e. show large I(D m ; Y ) values, and correlate less with the input (low I(X; D m )). Thus, they typically show up in the top-left corner of the MI plane graphs.

Figure 15: (a-c) MI plane comparisons for local (lRRA) versus full models (RR A ). Points on each line correspond to layers of one type of model. a) MI Plane for task A. All points of RR A and the majority of points for lRRA (five out seven) are located in the center of the graph, i.e., successfully connect in-and ouput distributions. b,c): After fine-tuning for A/B. The last layer D7 of RR AA builds the strongest relationship with Y . I(D7; Y ) of lRRA is only slightly lower than RR AA . d): Accuracy comparisons among different models: RR AA yields the highest performance, while lRRA performs similarly with RR AA .

Figure 16: Additional results for the disentangled representations with the MNIST data: For every row in the figures, we vary the corresponding latent code (left to right), while keeping all other inputs constant. Different rows indicate a different random noise input. For example, in (b): every column contains five results which are generated with different noise samples, but the same latent codes c1∼3. In every row, 10 results are generated with 10 different values of c1, which correspond to one digit each for (b). (a): For a regular training (Std), no clear correspondence between c1 and the outputs are apparent (similarly for c2,3). (c): Different c2 values result in a tweaked style, while c3 controls the orientation of the digit, as shown in (d). Thus, in contrast to Std, the pretrained model learns a meaningful, disentangled representation.

Figure 18: Example outputs for PreAB 1 , StdAB 1 , RR AB 1 . The reference is shown for comparison. RR AB 1 produces higher quality results than StdAB 1 and PreAB 1 .

between v i m and the training data (lower values being better).

1 A , Ort A , Pre A and Std A are reused for training the fine-tuned models without regularization, i.e. RR AA , RR 1 AA , Ort AA , Pre AA and Std AA . Likewise, all layers of the transfer task models RR AB , RR 1 AB , Ort AB , Pre AB and Std AB are initialized from the models of the first training stage.

annex

likewise uses an autoencoder structure. Visual comparisons are provided in figure 19 , where RR AB2 generates results that are closer to the reference. Overall, these results demonstrate the benefits of our pretraining for GANs, and indicate its potential to obtain more generic features from synthetic data sets that can be used for tasks involving real-world data.

C.3 WEATHER FORECASTING

Training Details: The weather forecasting scenario discussed in the main text follows the methodology of the WeatherBench benchmark (Rasp et al., 2020) . This benchmark contains 40 years of data from the ERA5 reanalysis project Hersbach et al. (2020) which was re-sampled to a 5. The neural networks for the forecasting tasks employ a ResNet architecture with 19 layers, all of which contain 128 features with 3 × 3 kernels (apart from 7 × 7 in the first layer). All layers use batch normalization, leaky ReLU activation (tangent 0.3), and dropout with strength 0.1. As inputs, the model receives feature-wise concatenated data from the WeatherBench data for 3 consecutive time steps, i.e., t, t -6h, and t -12h, yielding 117 channels in total. The last convolution jointly generates all three output fields, i.e., pressure at 500 hPa (Z500), temperature at 850 hPa (T850), and the 2-meter temperature (T2M).

Analysis of Results:

In addition to the quantitative results for both years of test data given in the main text, figure 20 and 21 contain additional example visualizations from the test data set. A visualization of the spatial error distribution w.r.t. ground truth result in also shown in figure 21 . It becomes apparent that our pretraining achieves reduced errors across the whole range of samples. Both temperature targets contain a larger number of smaller scale features than the pressure fields. While the gains from our pretraining approach are not huge (on the order of 3% in both cases), they represent important steps forward. The learning objective is highly non-trivial, and the improvements were achieved with the same limited set of training data. Being very easy to integrate into existing training pipelines, these results indicate that the proposed pretraining methodology has the potential to yield improved learning results for a wide range of problem settings.

