AUTOREGRESSIVE GENERATIVE MODELING WITH NOISE CONDITIONAL MAXIMUM LIKELIHOOD ESTI-MATION

Abstract

We introduce a simple modification to the standard maximum likelihood estimation (MLE) framework. Rather than maximizing a single unconditional model likelihood, we maximize a family of noise conditional likelihoods consisting of the data perturbed by a continuum of noise levels. We find that models trained this way are more robust to noise, obtain higher test likelihoods, and generate higher quality images. They can also be sampled from via a novel score-based sampling scheme which combats the classical covariate shift problem that occurs during sample generation in autoregressive models. Applying this augmentation to autoregressive image models, we obtain 3.32 bits per dimension on the Ima-geNet 64x64 dataset, and substantially improve the quality of generated samples in terms of the Frechet Inception distance (FID) -from 37.50 to 12.09 on the CIFAR-10 dataset.

1. INTRODUCTION

Maximum likelihood estimation (MLE) is arguably the gold standard for probabilistic model fitting, and serves as the de facto method for parameter estimation in countless statistical problems Bishop (2006) , across a variety of fields. Estimators obtained via MLE enjoy a number of theoretical guarantees, including consistency, efficiency, asymptotic normality, and equivariance to model reparameterizations Van der Vaart (2000) . In the field of density estimation and generative modeling, MLE models have played a key role, where autoregressive models and normalizing flows have boasted competitive performance in a bevy of domains, including images Child et al. (2019 ), text Vaswani et al. (2017 ), audio Oord et al. (2016 ), and tabular data Papamakarios et al. (2017) . However, while log-likelihood is broadly agreed upon as one of the most rigorous metrics for goodness-of-fit in statistical and generative modeling, models with high likelihoods do not necessarily produce samples of high visual quality. This phenomenon has been discussed at length by In this paper, we offer such a framework. We further analyze the likelihood-sample quality mismatch in autoregressive models, and propose techniques inspired by diffusion models to alleviate it. In particular, we leverage the fact that the score function is naturally learned as a byproduct of maximum likelihood estimation. This allows a novel two-part sampling strategy with noisy sampling and score-based refinement. Our contributions are threefold. 1) We investigate the pitfalls of training and inference under the maximum likelihood estimation framework, particularly regarding sensitivity to noise corruptions. 2) We propose a simple sanity test for checking the robustness of likelihood models to minor perturbations, and find that many models fail this test. 3) We introduce a novel framework for the training and sampling of MLE models that significantly improves the noise-robustness and generated sample quality. As a result, we obtain a model that can generate samples at a quality approaching that of diffusion models, without losing the maximum likelihood framework and O(1) likelihood evaluation speed of MLE models.

2. BACKGROUND AND RELATED WORK

Let our dataset X consist of i.i.d. samples drawn from an unknown target density x ∼ p data (x). The goal of likelihood-based generative modeling is to approximate p data via a parametric model p θ , where samples x ∼ p θ can be easily obtained.



Though each conditional score is trained via score matching, the final model depends heavily on the solution of a chosen SDE which is not specified by the framework. Thus score matching does not directly produce a diffusion model.



Theis et al. (2015);Huszár (2015), and corroborated in empirical studiesGrover et al. (2018);Kim  et al. (2022). Autoregressive models suffer an additional affliction: they have notoriously unstable dynamics during sample generationBengio et al. (2015);Lamb et al. (2016)  due to their sequential sampling algorithm, which can cause errors to compound across time steps. Such errors cannot usually be corrected ex post facto due to the autoregressive structure of the model, and can substantially affect downstream steps as we find that model likelihoods are highly sensitive to even the most minor of perturbations. Score-based diffusion models Song et al. (2020b); Ho et al. (2020) offer a different perspective on the data generation process. Even though sampling is also sequential, diffusion models are more robust to perturbations because, in essence, they are trained as denoising functions Ho et al. (2020). Moreover, the update direction in each step is unconstrained (unlike token-wise autoregressive models, which can only update one token at a time, and only once), meaning the model can correct errors from previous steps. However, likelihood evaluations have no closed form, requiring ODE/SDE solvers and hundreds to thousands of calls to the underlying network, and rendering the models incapable of being trained via MLE. Diffusion models also do not inherit any of the asymptotic guarantees Hyvärinen & Dayan (2005); Song et al. (2020a) of score matching, even though they

Figure 1: Generated samples on CelebA 64x64 (above) and CIFAR-10 (below). Autoregressive models trained via vanilla maximum likelihood (left) are brittle to sampling errors and can quickly diverge, producing nonsensical results. Those trained by our algorithm (right) are more robust and ultimately generate more coherent sequences.

