IMPROVED AUTOREGRESSIVE MODELING WITH DISTRIBUTION SMOOTHING

Abstract

While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We first model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves the sample quality of existing autoregressive models on several synthetic and real-world image datasets while obtaining competitive likelihoods on synthetic datasets.

1. INTRODUCTION

Autoregressive models have exhibited promising results in a variety of downstream tasks. For instance, they have shown success in compressing images (Minnen et al., 2018) , synthesizing speech (Oord et al., 2016a) and modeling complex decision rules in games (Vinyals et al., 2019) . However, the sample quality of autoregressive models on real-world image datasets is still lacking. Poor sample quality might be explained by the manifold hypothesis: many real world data distributions (e.g. natural images) lie in the vicinity of a low-dimensional manifold (Belkin & Niyogi, 2003) , leading to complicated densities with sharp transitions (i.e. high Lipschitz constants), which are known to be difficult to model for density models such as normalizing flows (Cornish et al., 2019) . Since each conditional of an autoregressive model is a 1-dimensional normalizing flow (given a fixed context of previous pixels), a high Lipschitz constant will likely hinder learning of autoregressive models. Another reason for poor sample quality is the "compounding error" issue in autoregressive modeling. To see this, we note that an autoregressive model relies on the previously generated context to make a prediction; once a mistake is made, the model is likely to make another mistake which compounds (Kääriäinen, 2006) , eventually resulting in questionable and unrealistic samples. Intuitively, one would expect the model to assign low-likelihoods to such unrealistic images, however, this is not always the case. In fact, the generated samples, although appearing unrealistic, often are assigned high-likelihoods by the autoregressive model, resembling an "adversarial example" (Szegedy et al., 2013; Biggio et al., 2013) , an input that causes the model to output an incorrect answer with high confidence. Inspired by the recent success of randomized smoothing techniques in adversarial defense (Cohen et al., 2019) , we propose to apply randomized smoothing to autoregressive generative modeling. More specifically, we propose to address a density estimation problem via a two-stage process. Unlike Cohen et al. (2019) which applies smoothing to the model to make it more robust, we apply smoothing to the data distribution. Specifically, we convolve a symmetric and stationary noise distribution with the data distribution to obtain a new "smoother" distribution. In the first stage, we model the smoothed version of the data distribution using an autoregressive model. In the second stage, we reverse the smoothing process-a procedure which can also be understood as "denoising"-by either applying a gradient-based denoising approach (Alain & Bengio, 2014) or introducing another conditional autoregressive model to recover the original data distribution from the smoothed one. By choosing an appropriate smoothing distribution, we aim to make each step easier than the original learning problem: smoothing facilitates learning in the first stage by making the input distribution Figure 1 : Overview of our method. From a data distribution (x) we inject noise (q(x|x)) which makes the distribution smoother (x); then we model the smoothed distribution (p θ (x)) as well as the denoising step (p θ (x|x)), forming a two-step model. fully supported without sharp transitions in the density function; generating a sample given a noisy one is easier than generating a sample from scratch. We show with extensive experimental results that our approach is able to drastically improve the sample quality of current autoregressive models on several synthetic datasets and real-world image datasets, while obtaining competitive likelihoods on synthetic datasets. We empirically demonstrate that our method can also be applied to density estimation, image inpainting, and image denoising.

2. BACKGROUND

We consider a density estimation problem. Given D-dimensional i.i.d samples {x 1 , x 2 , ..., x N } from a continuous data distribution p data (x), the goal is to approximate p data (x) with a model p θ (x) parameterized by θ. A commonly used approach for density estimation is maximum likelihood estimation (MLE), where the objective is to maximize L(θ) 1 N N i=1 log p θ (x i ).

2.1. AUTOREGRESSIVE MODELS

An autoregressive model (Larochelle & Murray, 2011; Salimans et al., 2017) decomposes a joint distribution p θ (x) into the product of univariate conditionals: p θ (x) = D i=1 p θ (x i |x <i ), where x i stands for the i-th component of x, and x <i refers to the components with indices smaller than i. In general, an autoregressive model parameterizes each conditional p θ (x i |x <i ) using a prespecified density function (e.g. mixture of logistics). This bounds the capacity of the model by limiting the number of modes for each conditional. Although autoregressive models have achieved top likelihoods amongst all types of density based models, their sample quality is still lacking compared to energy-based models (Du & Mordatch, 2019) and score-based models (Song & Ermon, 2019) . We believe this can be caused by the following two reasons.

2.2. MANIFOLD HYPOTHESIS

Several existing methods (Roweis & Saul, 2000; Tenenbaum et al., 2000) rely on the manifold hypothesis, i.e. that real-world high-dimensional data tends to lie on a low-dimensional manifold (Narayanan & Mitter, 2010) . If the manifold hypothesis is true, then the density of the data distribution is not well defined in the ambient space; if the manifold hypothesis holds only approximately and the data lies in the vicinity of a manifold, then only points that are very close to the manifold would have high density, while all other points would have close to zero density. Thus we may expect the data density around the manifold to have large first-order derivatives, i.e. the density function has a high Lipschitz constant (if not infinity). To see this, let us consider a 2-d example where the data distribution is a thin ring distribution (almost a unit circle) formed by rotating the 1-d Gaussian distribution N (1, 0.01 2 ) around the origin. The density function of the ring has a high Lipschitz constant near the "boundary". Let us focus on a data point travelling along the diagonal as shown in the leftmost panel in figure 2 . We plot the first-order directional derivatives of the density for the point as it approaches the boundary from the inside, then lands on the ring, and finally moves outside the ring (see figure 2 ). As we can see, when the point is far from the boundary, the derivative has a small magnitude. When the point moves closer to the boundary, the magnitude increases and changes significantly near the boundary even with small displacements in the trajectory. However, once the point has landed on the ring, the magnitude starts to decrease. As it gradually moves off the ring, the magnitude first increases and then decreases just like when the point approached the boundary from the inside. It has been observed that certain likelihood models, such as normalizing flows, exhibit pathological behaviors on data distributions whose densities have high Lipschitz constants (Cornish et al., 2019) . Since each conditional of an autoregressive model is a 1-d normalizing flow given a fixed context, a high Lipschitz constant on data density could also hinder learning of autoregressive models.

2.3. COMPOUNDING ERRORS IN AUTOREGRESSIVE MODELING

Autoregressive models can also be susceptible to compounding errors from the conditional distributions (Lamb et al., 2016) during sampling time. We notice that an autoregressive model p θ (x) learns the joint density p data (x) by matching each of the conditional p θ (x i |x <i ) with p data (x i |x <i ). In practice, we typically have access to a limited amount of training data, which makes it hard for an autoregressive model to capture all the conditional distributions correctly due to the curse of dimensionality. During sampling, since a prediction is made based on the previously generated context, once a mistake is made at a previous step, the model is likely to make more mistakes in the later steps, eventually generating a sample x that is far from being an actual image, but is mistakenly assigned a high-likelihood by the model. The generated image x, being unrealistic but assigned a high-likelihood, resembles an adversarial example, i.e., an input that causes the model to make mistakes. Recent works (Cohen et al., 2019) in adversarial defense have shown that random noise can be used to improve the model's robustness to adversarial perturbations -a process during which adversarial examples that are close to actual data are generated to fool the model. We hypothesize that such approach can also be applied to improve an autoregressive modeling process by making the model less vulnerable to compounding errors occurred during density estimation. Inspired by the success of randomized smoothing in adversarial defense (Cohen et al., 2019) , we propose to apply smoothing to autoregressive modeling to address the problems mentioned above.

3. GENERATIVE MODELS WITH DISTRIBUTION SMOOTHING

In the following, we propose to decompose a density estimation task into a smoothed data modeling problem followed by an inverse smoothing problem where we recover the true data density from the smoothed one.

3.1. RANDOMIZED SMOOTHING PROCESS

Unlike Cohen et al. (2019) where randomized smoothing is applied to a model, we apply smoothing directly to the data distribution p data (x). To do this, we introduce a smoothing distribution q(x|x)a distribution that is symmetric and stationary (e.g. a Gaussian or Laplacian kernel) -and convolve it with p data (x) to obtain a new distribution q(x) q(x|x)p data (x)dx. When q(x|x) is a normal distribution, this convolution process is equivalent to perturbing the data distribution with Gaussian noise, which, intuitively, will make the data distribution smoother. In the following, we formally prove that convolving a 1-d distribution p data (x) with a suitable noise can indeed "smooth" p data (x). Theorem 1. Given a continuous and bounded 1-d distribution p data (x) that is supported on R, for any 1-d distribution q(x|x) that is symmetric (i.e. q(x|x) = q(x|x)), stationary (i.e. translation invariant) and satisfies lim x→∞ p data (x)q(x|x) = 0 for any given x, we have Lip(q(x)) ≤ Lip(p data (x)), where q(x) q(x|x)p data (x)dx and Lip(•) denotes the Lipschitz constant of the given 1-d function. Theorem 1 shows that convolving a 1-d data distribution p data (x) with a suitable noise distribution q(x|x) (e.g. N (x|x, σ 2 )) can reduce the Lipschitzness (i.e. increase the smoothness) of p data (x). We provide the proof of Theorem 1 in Appendix A. Given p data (x) with a high Lipschitz constant, we empirically verify that density estimation becomes an easier task on the smoothed distribution q(x) than directly on p data (x). To see this, we visualize a 1-d example in figure 3a , where we want to model a ten-mode data distribution with a mixture of logistics model. If our model has three logistic components, there is almost no way for the model, which only has three modes, to perfectly fit this data distribution, which has ten separate modes with sharp transitions. The model, after training (see figure 3a ), mistakenly assigns a much higher density to the low density regions between nearby modes. If we convolve the data distribution with q(x|x) = N (x|x, 0.5 2 ), the new distribution becomes smoother (see figure 3b ) and can be captured reasonably well by the same mixture of logistics model with only three modes (see figure 3b ). Comparing the same model's performance on the two density estimation tasks, we can see that the model is doing a better job at modeling the smoothed version of the data distribution than the original data distribution, which has a high Lipschitz constant. This smoothing process can also be understood as a regularization term for the original maximum likelihood objective (on the un-smoothed data distribution), encouraging the learned model to be smooth, as formalized by the following statement: Proposition 1 (Informal). Assume that the symmetric and stationary smoothing distribution q(x|x) has small variance and negligible higher order moments, then E p data (x) E q(x|x) [log p θ (x)] ≈ E p data (x) log p θ (x) + η 2 i ∂ 2 log p θ ∂x 2 i , for some constant η. Proposition 1 shows that our smoothing process provides a regularization effect on the original objective E pdata(x) [log p θ (x)] when no noise is added, where the regularization aims to maximize η 2 i ∂ 2 log p θ ∂x 2 i . Since samples from p data should be close to a local maximum of the model, this encourages the second order gradients computed at a data point x to become closer to zero (if it were positive then x will not be a local maximum), creating a smoothing effect. This extra term is also the trace of the score function (up to a multiplicative constant) that can be found in the score matching objective (Hyvärinen, 2005) , which is closely related to many denoising methods (Vincent, 2011; Hyvärinen, 2008) . This regularization effect can, intuitively, increase the generalization capability of the model. In fact, it has been demonstrated empirically that training with noise can lead to improvements in network generalization (Sietsma & Dow, 1991; Bishop, 1995) . Our argument is also similar to that used in (Bishop, 1995) except that we consider a more general generative modeling case as opposed to supervised learning with squared error. We provide the formal statement and proof of Proposition 1 in Appendix A.

3.2. AUTOREGRESSIVE DISTRIBUTION SMOOTHING MODELS

Motivated by the previous 1-d example, instead of directly modeling p data (x), which can have a high Lipschitz constant, we propose to first train an autoregressive model on the smoothed version of the data distribution q(x). Although the smoothing process makes the distribution easier to learn, it also introduces bias. Thus, we need an extra step to debias the learned distribution by reverting the smoothing process. If our goal is to generate approximate samples for p data (x), when q(x|x) = N (x|x, σ 2 I) and σ is small, we can use the gradient of p θ (x) for denoising (Alain & Bengio, 2014) . More specifically, given smoothed samples x from p θ (x), we can "denoise" samples via: x = x + σ 2 ∇ x log p θ (x), (2) which only requires the knowledge of p θ (x) and the ability to sample from it. However, this approach does not provide a likelihood estimate and Eq. ( 2) only works when q(x|x) is Gaussian (though alternative denoising updates for other smoothing processes could be derived under the Empirical Bayes framework (Raphan & Simoncelli, 2011) ). Although Eq. ( 2) could provide reasonable denoising results when the smoothing distribution has a small variance, x obtained in this way is only a point estimation of x = E[x|x] and does not capture the uncertainty of p(x|x). To invert more general smoothing distributions (beyond Gaussians) and to obtain likelihood estimations, we introduce a second autoregressive model p θ (x|x). The parameterized joint density p θ (x, x) can then be computed as p θ (x, x) = p θ (x|x)p θ (x). To obtain our approximation of p data (x), we need to integrate over x on the joint distribution p θ (x, x) to obtain p θ (x) = p θ (x, x)dx, which is in general intractable. However, we can easily obtain an evidence lower bound (ELBO): log p θ (x) ≥ E q(x|x) [log p θ (x)] -E q(x|x) [log q(x|x)] + E q(x|x) [log p θ (x|x)]. Note that when q(x|x) is fixed, the entropy term E q(x|x) [log q(x|x)] is a constant with respect to the optimization parameters. Maximizing ELBO on p data (x) is then equivalent to maximizing: J(θ) = E pdata(x) E q(x|x) [log p θ (x)] + E q(x|x) [log p θ (x|x)] . From equation 4, we can see that optimizing the two models p θ (x) and p θ (x|x) separately via maximum likelihood estimation is equivalent to optimizing J(θ).

3.3. TRADEOFF IN MODELING

In general, there is a trade-off between the difficulty of modeling p θ (x) and p θ (x|x). To see this, let us consider two extreme cases for the variance of q(x|x) -when q(x|x) has a zero variance and an infinite variance. When q(x|x) has a zero variance, q(x|x) is a distribution with all its probability mass at x, meaning that no noise is added to the data distribution. In this case, modeling the smoothed distribution would be equivalent to modeling p data (x), which can be hard as discussed above. The reverse smoothing process, however, would be easy since p θ (x|x) can simply be an identity map to perfectly invert the smoothing process. In the second case when q(x|x) has an infinite variance, modeling p(x) would be easy because all the information about the original data is lost, and p(x) would be close to the smoothing distribution. Modeling p(x|x), on the other hand, is equivalent to directly modeling p data (x), which can be challenging. Thus, the key here is to appropriately choose a smoothing level so that both q(x) and p(x|x) can be approximated relatively well by existing autoregressive models. In general, the optimal variance might be hard to find. Although one can train q(x|x) by jointly optimizing ELBO, in practice, we find this approach often assigns a very large variance to q(x|x), which can trade-off sample quality for better likelihoods on high dimensional image datasets. We find empirically that a pre-specified q(x|x) chosen by heuristics (Saremi & Hyvarinen, 2019; Garreau et al., 2017 ) is able to generate much better samples than training q(x|x) via ELBO. In this paper, we will focus on the sample quality and leave the training of q(x|x) for future work. 

4. EXPERIMENTS

In this section, we demonstrate empirically that by appropriately choosing the smoothness level of randomized smoothing, our approach is able to drastically improve the sample quality of existing autoregressive models on several synthetic and real-world datasets while retaining competitive likelihoods on synthetic datasets. We also present results on image inpainting in Appendix C.2.

4.1. CHOOSING THE SMOOTHING DISTRIBUTION

To help us build insights into the selection of the smoothing distribution q(x|x), we first focus on a 1-d multi-modal distribution (see figure 4 leftmost panel). We use model-based methods to invert the smoothed distribution and provide analysis on "single-step denoising" in Appendix B.1. We start with the exploration of three different types of smoothing distributions -Gaussian distribution, Laplace distribution, and uniform distribution. For each type of distribution, we perform a grid search to find the optimal variance. Since our approach requires the modeling of both p θ (x) and p θ (x|x), we stack x and x together, and use a MADE model (Germain et al., 2015) with a mixture of two logistic components to parameterize p θ (x) and p θ (x|x) at the same time. For the baseline model, we train a mixture of logistics model directly on p data (x). We compare the results in the middle two panels in figure 4 . We find that although the baseline with eight logistic components has the capacity to perfectly model the multi-modal data distribution, which has six modes, the baseline model still fails to do so. We believe this can be caused by optimization or initialization issues for modeling a distribution with a high Lipschitz constant. Our method, on the other hand, demonstrates more robustness by successfully modeling the different modes in the data distribution even when using only two mixture components for both p θ (x) and p θ (x|x). For all the three types of smoothing distributions, we observe a reverse U-shape correlation between the variance of q(x|x) and ELBO values -with ELBO first increasing as the variance increases and then decreasing as the variance grows beyond a certain point. The results match our discussion on the trade-off between modeling p θ (x) and p θ (x|x) in Section 3.3. We notice from the empirical results that Gaussian smoothing is able to obtain better ELBO than the other two distributions. Thus, we will use q(x|x) = N (x|x, σ 2 I) for the later experiments.

4.2. 2-D SYNTHETIC DATASETS

In this section, we consider two challenging 2-d multi-modal synthetic datasets (see figure 5 ). We focus on model-based denoising methods and present discussion on "single-step denoising" in Appendix B.2. We use a MADE model with comparable number of total parameters for both the baseline and our approach. For the baseline, we train the MADE model directly on the data. For our randomized smoothing model, we choose q(x|x) = N (x|x, 0.3 2 I) to be the smoothing distribution. We observe that with this randomized smoothing approach, our model is able to generate better samples than the baseline (according to a human observer) even when using less logistic components (see figure 5 ). We provide more analysis on the model's performance in Appendix B.2. We also provide the negative log-likelihoods in Tab. 1.

4.3. IMAGE EXPERIMENTS

In this section, we focus on three common image datasets, namely MNIST, CIFAR-10 ( Krizhevsky et al., 2009) and CelebA (Liu et al., 2015) . We select q(x|x) = N (x|x, σ 2 I) to be the smoothing distribution. We use PixelCNN++ (Salimans et al., 2017) as the model architecture for both p θ (x) and p θ (x|x). We provide more details about settings in Appendix C. Image generation. For image datasets, we select the σ of q(x|x) = N (x|x, σ 2 I) according to analysis in (Saremi & Hyvarinen, 2019) (see Appendix C for more details). Since q(x|x) is a Gaussian distribution, we can apply "single-step denoising" to reverse the smoothing process for samples drawn from p θ (x). In this case, the model p θ (x|x) is not required for sampling since the gradient of p θ (x) can be used to denoise samples (also from p θ (x)) (see equation 2). We present smoothed samples from p θ (x), reversed smoothing samples processed by "single-step denoising" and processed by p θ (x|x) in figure 6 . For comparison, we also present samples from a PixelCNN++ with parameters comparable to the sum of total parameters of p θ (x) and p θ (x|x). We find that by using this randomized smoothing approach, we are able to drastically improve the sample quality of PixelCNN++ (see the rightmost panel in figure 6 ). We note that with only p θ (x), a PixelCNN++ optimized on the smoothed data, we already obtain more realistic samples compared to the original PixelCNN++ method. However, p θ (x|x) is needed to compute the likelihood lower bounds. We report the sample quality evaluated by Fenchel Inception Distance (FID (Heusel et al., 2017) ), Kernel Inception Distance (KID (Bińkowski et al., 2018) ), and Inception scores (Salimans et al., 2016) in Tab. 2. Although our method obtains better samples compared to the original PixelCNN++, our model has worse likelihoods as evaluated in BPDs. We believe this is because likelihood and sample quality are not always directly correlated as discussed in Theis et al. (2015) . We also tried training the variance for q(x|x) by jointly optimizing ELBO. Although samples from "single-step" might appear visually similar to samples from the "two-step" method, there is still a gap between their Inception, FID and KID scores.

5. ADDITIONAL EXPERIMENTS ON NORMALIZING FLOWS

In this section, we demonstrate empirically on 2-d synthetic datasets that randomized smoothing techniques can also be applied to improve the sample quality of normalizing flow models (Rezende & Mohamed, 2015) . We focus on RealNVP (Dinh et al., 2016) . We compare the RealNVP model trained with randomized smoothing, where we use p θ (x|x) (also a RealNVP) to revert the smoothing process, with a RealNVP trained with the original method but with comparable number of parameters. We observe that smoothing is able to improve sample quality on the datasets we consider (see figure 7 ) while also obtaining competitive likelihoods. On the checkerboard dataset, our method has negative log-likelihoods 3.64 while the original RealNVP has 3.72; on the Olympics dataset, our method has negative log-likelihoods 1.32 while the original RealNVP has 1.80. This example demonstrates that randomized smoothing techniques can also be applied to normalizing flow models.

6. RELATED WORK

Our approach shares some similarities with denoising autoencoders (DAE, Vincent et al. (2008) ) which recovers a clean observation from a corrupted one. However, unlike DAE which has a train- This way, the model can start from a distribution that is easy to model and gradually move to the desired distribution. However, due to the large number of noise levels, such approaches require many steps for the chain to converge to the right data distribution. In this paper, we instead propose to use only one level of smoothing by modeling each step with a powerful autoregressive model instead of deterministic autoencoders. Motivated by the success of "randomized smoothing" techniques in adversarial defense (Cohen et al., 2019) , we perform randomized smoothing directly on the data distribution. Unlike denoising score matching (Vincent, 2011) , a technique closely related to denoising diffusion models and NCSN, which requires the perturbed noise to be a Gaussian distribution, we are able to work with different noise distributions. Our smoothing method is also relevant to "dequantization" approaches that are common in normalizing flow models, where the discrete data distribution is converted to a continuous one by adding continuous noise (Uria et al., 2013; Ho et al., 2019) . However the added noise for "dequantization" in flows is often indistinguishable to human eyes, and the reverse "dequantization" process is often ignored. In contrast, we consider noise scales that are significantly larger and thus a denoising process is required. Our method is also related to "quantization" approaches which reduce the number of "significant" bits that are modeled by a generative model (Kingma & Dhariwal, 2018; Menick & Kalchbrenner, 2018) . For instance, Glow (Kingma & Dhariwal, 2018) only models the 5 most significant bits of an image, which improves the visual quality of samples but decreases color fidelity. SPN (Menick & Kalchbrenner, 2018) introduces another network to predict the remaining bits conditioned on the 3 most significant bits already modeled. Modeling the most significant bits can be understood as capturing a data distribution perturbed by bit-wise correlated noise, similar to modeling smoothed data in our method. Modeling the remaining bits conditioned on the most significant ones in SPN is then similar to denoising. However, unlike these quantization approaches which process an image at the "significant" bits level, we apply continuous data independent Gaussian noise to the entire image with a different motivation to smooth the data density function.

7. DISCUSSION

In this paper, we propose to incorporate randomized smoothing techniques into autoregressive modeling. By choosing the smoothness level appropriately, this seemingly simple approach is able to drastically improve the sample quality of existing autoregressive models on several synthetic and real-world datasets while retaining reasonable likelihoods. Our work provides insights into how recent adversarial defense techniques can be leveraged to building more robust generative models. Since we apply randomized smoothing technique directly to the target data distribution other than the model, we believe our approach is also applicable to other generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs). A PROOFS Theorem 1. Given a continuous and bounded 1-d distribution p data (x) that is supported on R, for any 1-d distribution q(x|x) that is symmetric (i.e. q(x|x) = q(x|x)), stationary (i.e. translation invariant) and satisfies lim x→∞ p data (x)q(x|x) = 0 for any given x, we have Lip(q(x)) ≤ Lip(p data (x)), where q(x) q(x|x)p data (x)dx and Lip(•) denotes the Lipschitz constant of the given 1-d function. Proof. First, we have that: |∇ x p(x)| = |p(x)∇ x log p(x)| ≤ Lip(p) and if we assume symmetry, i.e. q(x| x) = q( x|x) then by integration by parts we have: ∇ xq( x) = E p(x) [∇ xq( x|x)] = E p(x) [∇ x q(x| x)] = -E p(x) [q(x| x)∇ x log p(x)] Therefore, Lip(q) = max x |-E p(x) [q(x| x)∇ x log p(x)]| = max x | x q(x| x)p(x)∇ x log p(x)| (7) ≤ max x | x q(x| x)Lip(p)| = Lip(p) max x | x q(x| x)| = Lip(p) which proves the result. Proposition 1 (Formal). Given a D-dimensional data distribution p data (x) and model distribution p θ (x), assume that the smoothing distribution q(x|x) satisfies: • log p θ is infinitely differentiable on the support of p θ (x) • q(x|x) is symmetric (i.e. q(x|x) = q(x|x)) • q(x|x) is stationary (i.e. translation invariant) • q(x|x) is bounded and fully supported on R D • q(x|x) is element-wise independent • E q(x|x) [(x -x) 2 ] is bounded, and E q(x|x) [(x i -x i ) 2 ] = η at each dimension i. Denote = x -x, then E p data (x) E q(x|x) [log p θ (x)] = E p data (x) log p θ (x) + η 2 i ∂ 2 log p θ ∂x 2 i + o( 2 )p data (x)p( )dxd , where o( 2 ) : R D → R is a function of such that lim →0 o( 2 ) 2 = 0. Thus when o( 2 )p data (x)p( )dxd → 0, we have E p data (x) E q(x|x) [log p θ (x)] → E p data (x) log p θ (x) + η 2 i ∂ 2 log p θ ∂x 2 i . Proof. To see this, we first note that the new training objective for the smoothed data distribution is E pdata(x) E q(x|x) [log p θ (x)]. Let = x -x, because of the assumptions we have, the PDF function q(x|x) can be reparameterized as p( ) which satisfies: p is bounded and fully supported on R D ; p is element-wise independent and E p( ) [ 2 i ] = E p( i) [ 2 i ] = η at each dimension i (i = 1, ..., D ). Then we have E pdata(x) E q(x|x) [log p θ (x)] = log p θ (x + )p data (x)p( )dxd , Using Taylor expansion, we have: log p θ (x + ) = log p θ (x) + i i ∂ log p θ ∂x i + 1 2 i,j i j ∂ 2 log p θ ∂x i ∂x j + o( 2 ). Since is independent of x and ∞ -∞ i d i = 0, ∞ -∞ ∞ -∞ i j d i d j = δ i,j η, where δ i,j is the Kronecker delta function, the right hand side of Equation 9becomes E pdata(x) [log p θ (x)] + 1 2 i 2 i ∂ 2 log p θ ∂x 2 i + o( 2 ) p data (x)p( )dxd (10) = E pdata(x) log p θ (x) + η 2 i ∂ 2 log p θ ∂x 2 i + o( 2 )p data (x)p( )dxd . ( ) When o( 2 )p data (x)p( )dxd → 0, we have E pdata(x) E q(x|x) [log p θ (x)] → E pdata(x) log p θ (x) + η 2 i ∂ 2 log p θ ∂x 2 i . ( ) where the second term on the right hand side serves as a regularization for the original objective E pdata(x) [log p θ (x)]. B DENOISING EXPERIMENTS B.1 ANALYSIS ON 1-D DENOISING To provide more insights into denoising, we first study "single-step denoising" (see equation 2) on a 1-d dataset. We choose the data distribution to be a two mixture of Gaussian distribution 0.5N (-0.3, 0.1 2 ) + 0.5N (0.3, 0.1 2 ) and the smoothing distribution to be q(x|x) = N (x|x, 0.3 2 ) (see figure 8a ). Since the convolution of two Gaussian distributions is also a Gaussian distribution, the smoothed data is a mixture of Gaussian distribution given by 0.5N (-0.3, 0.1 2 + 0.3 2 ) + 0.5N (0.3, 0.1 2 + 0.3 2 ). The ground truth of ∇ x log p(x) can then be calculated in closed form. Thus, given the smoothed data x, we can calculate the ground truth ∇ x log p(x) in equation 2 and obtain x using "single-step denoising". We visualize the denoising results in figure 8b . We find that the low density region between the two modes in p data (x) are not modeled properly in figure 8b . However, this is very expected since "single-step denosing" uses x = E[x|x] as the substitute for the denoised result. When the smoothing distribution has a large variance (like in figure 8a where the smoothed data has merged into a one mode distribution), datapoints like x0 in the middle low density region of p data (x) can have high density in the smoothed distribution. Since x0 , as well as other points in the middle low density region of p data (x), can come from both modes of p data (x) with high probability before the smoothing process (see figure 11a ), the denoised x = E[x|x = x0 ] can still be located in the middle low density region (see figure 8b ). Since a large proportion of the smoothed data is located in the middle low density region of p data (x), we would expect certain proportion of the density to remain in the low density region after "single-step denoising" just as shown in figure 8b . However, when the smoothing distribution has a smaller variance, "single-step denoising" can achieve much better denoising results (see figure 9 , where we use q(x|x) = N (x|x, 0.1 2 )). Although denoising can be easier when the smoothing distribution has a smaller variance, modeling the smoothed distribution could be harder as we discussed before. In 10c ). Thus, like x0 , the smoothed datapoints at the low density region between the two modes of p data (x) are still likely to remain between the two modes after denoising (see figure 10b ). To solve this issue, we can increase the capacity of p θ (x|x) by making it a two mixture of logistics. In this case, the distribution p θ (x|x = x0 ) can be captured in a better way (see figure 11c and figure 11a ). After the invert smoothing process, like x0 , most smoothed datapoints in the low density can be mapped to one of the two high density modes (see figure 11b ), resulting in much better denoising effects. We use a MADE model with comparable number of parameters for both our method and the baseline. The models have n mixture of logistics for each dimension. Our method is able to obtain reasonable samples when using fewer mixture components, while the baseline still has trouble modeling the two sides of the rings when n = 7. On the 2-d Olympics dataset in section 4.2, we find that the intersections between rings can be poorly modeled with the proposed smoothing approach when only two mixture of logistics are used (see figure 12e ). We believe this can be caused if the denoising model is not flexible enough to capture the distribution p(x|x). More specifically, we note that the ground truth distribution for p(x|x) at the intersections of the rings is a highly complicated distribution and can be hard to capture using our model which only has two mixtures of logistics for each dimension. If we increase the flexibility of p θ (x|x) by using three or four mixtures of logistics components (note that we still use fewer mixture components than the MADE baseline and we use comparable number of parameters), the intersection of the rings can be modeled in an improved way (see figure 12 ). We also provide "single-step denoising" results for the experiments in Section 4.2 (see figure 13 ), where we use the same smoothing distribution, and the MADE model with three mixture components as used in section 4.2. We note that "single-step denoising" results are not very good, which is also expected. As discussed in section B.1, when the smoothing distribution has a relatively large variance, E θ [x|x] is not a good approximation for the denoised result, and we want the denoised sample to come from the distribution p θ (x|x), in which case introducing a denoising model p θ (x|x) could be a better option. Although we could select q(x|x) to have a smaller variance so that "single-step denoing" could work reasonably well, but modeling p(x) in this case could be more challenging. For the image experiments, we first rescale images to [-1, 1] and then perturb the images with q(x|x) = N (x|x, σ 2 I). We use σ = 0.5 for MNIST and σ = 0.3 for both CIFAR-10 and CelebA. The selection of σ is mainly based on analysis in (Saremi & Hyvarinen, 2019) . More specifically, given an image, we consider the median value of the Euclidean distance between two data points in a dataset, and then divide it by 2 √ D, where D is the dimension of the data. This provides us with a way of selecting the variance of q(x|x), when q(x|x) is a Gaussian distribution. We find this selection of variance able to generate reasonably well samples in practice. We train all the models with Adam optimizer with learning rate 0.0002. To model p θ (x|x), we stack x and x together at the second dimension to obtain x = [x, x], which ensures that x comes before x in the pixel ordering. For instance, this stacking would provide an image x with size 1 × (2 × 28) × 28 on a MNIST image, and an image with size 3 × (2 × 32) × 32 on a CIFAR-10 image. Since PixelCNN++ consists of convolutional layers, we can directly feed x into the default architecture without modifying the model architecture. As the latter pixels of the input only depend on the previous pixels in an autoregressive model and x comes before x, we can parameterize p θ (x|x) by computing the likelihoods only on x using the outputs from the autoregressive model.

C.2 IMAGE INPAINTING

Since both p θ (x) and p θ (x|x) are parameterized by an autoregressive model, we can also perform image inpainting using our method. We present the inpainting results on CIFAR-10 in figure 14a and CelebA in figure 14b , where the bottom half of the input image is being inpainted. 

C.6 ABLATION STUDIES

In this section, we show that gradient-based "single-step denoising" will not improve sample qualities without performing randomized smoothing. To see this, we draw samples from a PixelCNN++ p θ (x) trained directly on p data (x) (i.e. without smoothing). We perform "single-step denoising" update defined as x = x + σ 2 ∇ x log p θ (x). ( ) We explore various values for σ, and report the results in figure 26 . This shows that "single-step denoising" alone (without randomized smoothing) will not improve sample quality of PixelCNN++. 



Figure 2: Manifold hypothesis illustration. The data point is travelling along the diagonal as shown in the leftmost panel. The white arrow stands for the direction and magnitude of the derivative of density at the data point. The data location for each figure is ( √ 0.5 + c, √ 0.5 + c), where c is the number below each figure and ( √ 0.5, √ 0.5) is the upper right intersection of the trajectory with the unit circle.

Figure 3: Visualization of a 1-d data distribution without smoothing (a) or with smoothing (b), modeled by the same mixture of logistics model.

Figure 4: Density estimation on 1-d synthetic dataset. In the second figure, the digit in the parenthesis denotes the number of mixture components used in the baseline mixture of logistics model. In comparison, our model in the third figure uses only 2 mixture of logistics components for each univariate conditional distribution.

Figure 5: Samples on 2-d synthetic datasets. We use a MADE model with comparable number of parameters for both our method and the baseline. Our model uses 3 mixture of logistics, while the baseline uses 6 (more) mixture of logistics.

Figure 7: RealNVP samples on 2-d synthetic datasets. The RealNVP model trained with randomized smoothing is able to generate better samples according to human observers.

(a) Data distribution (q(x|x) = N (x|x, 0.3 2 )).(b) Single-step denoising results.

Figure 8: 1-d single-step (gradient based) denoising.

Figure 9: 1-d single-step (gradient based) denoising.

Figure 10: Denoising with p θ (x|x), which is modeled by one mixture of logistics.

Figure12: Samples on 2-d synthetic datasets. We use a MADE model with comparable number of parameters for both our method and the baseline. The models have n mixture of logistics for each dimension. Our method is able to obtain reasonable samples when using fewer mixture components, while the baseline still has trouble modeling the two sides of the rings when n = 7.

(a) Ground truth p(x|x = x0). (b) Denoising with p θ (x|x). (c) Distribution of p θ (x|x = x0).

Figure 11: Denoising with p θ (x|x), which is modeled by two mixtures of logistics.

Figure 13: Single-step denoising results on 2-d synthetic datasets. We use the same MADE model with three mixture components and the same smoothing distribution as mentioned in Section 4.2.

Figure 17: CIFAR-10 samples from p θ (x|x) (unconditioned on class labels).

Figure 18: CIFAR-10 samples from the original PixelCNN++ method (unconditioned on class labels).

Figure 19: CelebA samples from p θ (x) (unconditioned on class labels).

Figure 20: CelebA samples from p θ (x|x) (unconditioned on class labels).

Figure 21: CelebA samples from the original PixelCNN++ method (unconditioned on class labels).

Figure 23: Nearest neighbors measured by the 2 distance in the feature space of an Inception V3 network pretrained on ImageNet. Images on the left of the red vertical line are samples from our model. Images on the right are nearest neighbors in the training dataset.

Figure 24: Nearest neighbors measured by the 2 distance between images. Images on the left of the red vertical line are samples from our model. Images on the right are nearest neighbors in the training dataset.

Figure 26: "Single-step denoising" on PixelCNN++ trained on un-smoothed data. σ = 0 corresponds to the original samples.

Negative log-likelihoods on 2-d synthetic datasets (lower is better). We compare withMADE (Germain et al., 2015), RealNVP(Dinh et al., 2016), CIF-RealNVP(Cornish et al., 2019).

Although training the variance can produce better likelihoods, it does not generate samples with comparable quality as our method (i.e. choosing variance by heuristics). Thus, it is hard to conclusively determine what is the best way of choosing q(x|x). We provide more image samples in Appendix C.4 and nearest neighbors analysis in Appendix C.5.

general, the right denoising results should be samples coming from p(x|x), which is the reason why samples from p θ (x|x) (i.e. introducing the model p θ (x|x)) is more ideal than using E θ [x|x] as a denoising substitute (i.e. "single-step denoising"). In general, the capacity of the denoising model p θ (x|x) also matters in terms of denoising results. Let us again consider the datapoint x0 shown in figure8a. If the invert smoothing model p θ (x|x) is a one mode logistic distribution, due to the mode covering property of maximum likelihood estimation, given the smoothed observation x0 , the best the model can do is to center its only mode at x0 for approximating p(x|x = x0 ) (see figure

ACKNOWLEDGEMENTS

The authors would like to thank Kristy Choi for reviewing the draft of the paper. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), ARO, and Amazon AWS.

C.3 IMAGE DENOISING

We notice that the reverse smoothing process can also be understood as a denoising process. Besides the "single-step denoising" approach shown above, we can also apply p θ (x|x) to denoise images.To visualize the denoising performance, we sample x test from the test set and perturb x test with q(x|x) to obtain a noisy sample xtest . We feed xtest into p θ (x|x = xtest ) and draw samples from the model. We visualize the results in figure 15 . As we can see, the model exhibits reasonable denoising results, which shows that the autoregressive model is capable of learning the data distribution when conditioned on the smoothed data. 

