ANYTIME SAMPLING FOR AUTOREGRESSIVE MODELS VIA ORDERED AUTOENCODING

Abstract

Autoregressive models are widely used for tasks such as image and audio generation. The sampling process of these models, however, does not allow interruptions and cannot adapt to real-time computational resources. This challenge impedes the deployment of powerful autoregressive models, which involve a slow sampling process that is sequential in nature and typically scales linearly with respect to the data dimension. To address this difficulty, we propose a new family of autoregressive models that enables anytime sampling. Inspired by Principal Component Analysis, we learn a structured representation space where dimensions are ordered based on their importance with respect to reconstruction. Using an autoregressive model in this latent space, we trade off sample quality for computational efficiency by truncating the generation process before decoding into the original data space. Experimentally, we demonstrate in several image and audio generation tasks that sample quality degrades gracefully as we reduce the computational budget for sampling. The approach suffers almost no loss in sample quality (measured by FID) using only 60% to 80% of all latent dimensions for image data. Code is available at https://github.com/Newbeeer/Anytime-Auto-Regressive-Model.

1. INTRODUCTION

Autoregressive models are a prominent approach to data generation, and have been widely used to produce high quality samples of images (Oord et al., 2016b; Salimans et al., 2017; Menick & Kalchbrenner, 2018 ), audio (Oord et al., 2016a ), video (Kalchbrenner et al., 2017) and text (Kalchbrenner et al., 2016; Radford et al., 2019) . These models represent a joint distribution as a product of (simpler) conditionals, and sampling requires iterating over all these conditional distributions in a certain order. Due to the sequential nature of this process, the computational cost will grow at least linearly with respect to the number of conditional distributions, which is typically equal to the data dimension. As a result, the sampling process of autoregressive models can be slow and does not allow interruptions. Although caching techniques have been developed to speed up generation (Ramachandran et al., 2017; Guo et al., 2017) , the high cost of sampling limits their applicability in many scenarios. For example, when running on multiple devices with different computational resources, we may wish to trade off sample quality for faster generation based on the computing power available on each device. Currently, a separate model must be trained for each device (i.e., computational budget) in order to trade off sample quality for faster generation, and there is no way to control this trade-off on the fly to accommodate instantaneous resource availability at time-of-deployment. To address this difficulty, we consider the novel task of adaptive autoregressive generation under computational constraints. We seek to build a single model that can automatically trade-off sample quality versus computational cost via anytime sampling, i.e., where the sampling process may be interrupted anytime (e.g., because of exhausted computational budget) to yield a complete sample whose sample quality decays with the earliness of termination. In particular, we take advantage of a generalization of Principal Components Analysis (PCA) proposed by Rippel et al. (2014) , which learns an ordered representations induced by a structured application of dropout to the representations learned by an autoencoder. Such a representation encodes raw data into a latent space where dimensions are sorted based on their importance for reconstruction. Autoregressive modeling is then applied in the ordered representation space instead. This approach enables a natural trade-off between quality and computation by truncating the length of the representations: When running on devices with high computational capacity, we can afford to generate the full representation and decode it to obtain a high quality sample; when on a tighter computational budget, we can generate only the first few dimensions of the representation and decode it to a sample whose quality degrades smoothly with truncation. Because decoding is usually fast and the main computation bottleneck lies on the autoregressive part, the run-time grows proportionally relative to the number of sampled latent dimensions. Through experiments, we show that our autoregressive models are capable of trading off sample quality and inference speed. When training autoregressive models on the latent space given by our encoder, we witness little degradation of image sample quality using only around 60% to 80% of all latent codes, as measured by Fréchet Inception Distance (Heusel et al., 2017) on CIFAR-10 and CelebA. Compared to standard autoregressive models, our approach allows the sample quality to degrade gracefully as we reduce the computational budget for sampling. We also observe that on the VCTK audio dataset (Veaux et al., 2017) , our autoregressive model is able to generate the low frequency features first, then gradually refine the waveforms with higher frequency components as we increase the number of sampled latent dimensions.

2. BACKGROUND

Autoregressive Models Autoregressive models define a probability distribution over data points x ∈ R D by factorizing the joint probability distribution as a product of univariate conditional distributions with the chain rule. Using p θ to denote the distribution of the model, we have: p θ (x) = D i=1 p θ (x i | x 1 , • • • , x i-1 ) The model is trained by maximizing the likelihood: L = E p d (x) [log p θ (x)], where p d (x) represents the data distribution. Different autoregressive models adopt different orderings of input dimensions and parameterize the conditional probability p θ ( x i | x 1 , • • • , x i-1 ), i = 1, • • • , D in different ways. Most architectures over images order the variables x 1 , • • • , x D of image x in raster scan order (i.e., left-toright then top-to-bottom). Popular autoregressive architectures include MADE (Germain et al., 2015) , PixelCNN (Oord et al., 2016b; van den Oord et al., 2016; Salimans et al., 2017) and Transformer (Vaswani et al., 2017) , where they respectively use masked linear layers, convolutional layers and self-attention blocks to ensure that the output corresponding to p θ ( x i | x 1 , • • • , x i-1 ) is oblivious of x i , x i+1 , • • • , x D . Cost of Sampling During training, we can evaluate autoregressive models efficiently because x 1 , • • • , x D are provided by data and all conditionals p(x i | x 1 , • • • , x i-1 ) can be computed in parallel. In contrast, sampling from autoregressive models is an inherently sequential process and cannot be easily accelerated by parallel computing: we first need to sample x 1 , after which we sample x 2 from p θ (x 2 | x 1 ) and so on-the i-th variable x i can only be obtained after we have already computed x 1 , • • • , x i-1 . Thus, the run-time of autoregressive generation grows at least linearly with respect to the length of a sample. In practice, the sample length D can be more than hundreds of

