IMPROVED AUTOREGRESSIVE MODELING WITH DISTRIBUTION SMOOTHING

Abstract

While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We first model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves the sample quality of existing autoregressive models on several synthetic and real-world image datasets while obtaining competitive likelihoods on synthetic datasets.

1. INTRODUCTION

Autoregressive models have exhibited promising results in a variety of downstream tasks. For instance, they have shown success in compressing images (Minnen et al., 2018) , synthesizing speech (Oord et al., 2016a) and modeling complex decision rules in games (Vinyals et al., 2019) . However, the sample quality of autoregressive models on real-world image datasets is still lacking. Poor sample quality might be explained by the manifold hypothesis: many real world data distributions (e.g. natural images) lie in the vicinity of a low-dimensional manifold (Belkin & Niyogi, 2003) , leading to complicated densities with sharp transitions (i.e. high Lipschitz constants), which are known to be difficult to model for density models such as normalizing flows (Cornish et al., 2019) . Since each conditional of an autoregressive model is a 1-dimensional normalizing flow (given a fixed context of previous pixels), a high Lipschitz constant will likely hinder learning of autoregressive models. Another reason for poor sample quality is the "compounding error" issue in autoregressive modeling. To see this, we note that an autoregressive model relies on the previously generated context to make a prediction; once a mistake is made, the model is likely to make another mistake which compounds (Kääriäinen, 2006) , eventually resulting in questionable and unrealistic samples. Intuitively, one would expect the model to assign low-likelihoods to such unrealistic images, however, this is not always the case. In fact, the generated samples, although appearing unrealistic, often are assigned high-likelihoods by the autoregressive model, resembling an "adversarial example" (Szegedy et al., 2013; Biggio et al., 2013) , an input that causes the model to output an incorrect answer with high confidence. Inspired by the recent success of randomized smoothing techniques in adversarial defense (Cohen et al., 2019) , we propose to apply randomized smoothing to autoregressive generative modeling. More specifically, we propose to address a density estimation problem via a two-stage process. Unlike Cohen et al. (2019) which applies smoothing to the model to make it more robust, we apply smoothing to the data distribution. Specifically, we convolve a symmetric and stationary noise distribution with the data distribution to obtain a new "smoother" distribution. In the first stage, we model the smoothed version of the data distribution using an autoregressive model. In the second stage, we reverse the smoothing process-a procedure which can also be understood as "denoising"-by either applying a gradient-based denoising approach (Alain & Bengio, 2014) or introducing another conditional autoregressive model to recover the original data distribution from the smoothed one. By choosing an appropriate smoothing distribution, we aim to make each step easier than the original learning problem: smoothing facilitates learning in the first stage by making the input distribution We show with extensive experimental results that our approach is able to drastically improve the sample quality of current autoregressive models on several synthetic datasets and real-world image datasets, while obtaining competitive likelihoods on synthetic datasets. We empirically demonstrate that our method can also be applied to density estimation, image inpainting, and image denoising.

2. BACKGROUND

We consider a density estimation problem. Given D-dimensional i.i.d samples {x 1 , x 2 , ..., x N } from a continuous data distribution p data (x), the goal is to approximate p data (x) with a model p θ (x) parameterized by θ. A commonly used approach for density estimation is maximum likelihood estimation (MLE), where the objective is to maximize L(θ) 1 N N i=1 log p θ (x i ).

2.1. AUTOREGRESSIVE MODELS

An autoregressive model (Larochelle & Murray, 2011; Salimans et al., 2017) decomposes a joint distribution p θ (x) into the product of univariate conditionals: p θ (x) = D i=1 p θ (x i |x <i ), where x i stands for the i-th component of x, and x <i refers to the components with indices smaller than i. In general, an autoregressive model parameterizes each conditional p θ (x i |x <i ) using a prespecified density function (e.g. mixture of logistics). This bounds the capacity of the model by limiting the number of modes for each conditional. Although autoregressive models have achieved top likelihoods amongst all types of density based models, their sample quality is still lacking compared to energy-based models (Du & Mordatch, 2019) and score-based models (Song & Ermon, 2019) . We believe this can be caused by the following two reasons.

2.2. MANIFOLD HYPOTHESIS

Several existing methods (Roweis & Saul, 2000; Tenenbaum et al., 2000) rely on the manifold hypothesis, i.e. that real-world high-dimensional data tends to lie on a low-dimensional manifold (Narayanan & Mitter, 2010) . If the manifold hypothesis is true, then the density of the data distribution is not well defined in the ambient space; if the manifold hypothesis holds only approximately and the data lies in the vicinity of a manifold, then only points that are very close to the manifold would have high density, while all other points would have close to zero density. Thus we may expect the data density around the manifold to have large first-order derivatives, i.e. the density function has a high Lipschitz constant (if not infinity). To see this, let us consider a 2-d example where the data distribution is a thin ring distribution (almost a unit circle) formed by rotating the 1-d Gaussian distribution N (1, 0.01 2 ) around the origin. The density function of the ring has a high Lipschitz constant near the "boundary". Let us focus on a data point travelling along the diagonal as shown in the leftmost panel in figure 2 . We plot the first-order



Figure 1: Overview of our method. From a data distribution (x) we inject noise (q(x|x)) which makes the distribution smoother (x); then we model the smoothed distribution (p θ (x)) as well as the denoising step (p θ (x|x)), forming a two-step model.

