AUTOREGRESSIVE GENERATIVE MODELING WITH NOISE CONDITIONAL MAXIMUM LIKELIHOOD ESTI-MATION

Abstract

We introduce a simple modification to the standard maximum likelihood estimation (MLE) framework. Rather than maximizing a single unconditional model likelihood, we maximize a family of noise conditional likelihoods consisting of the data perturbed by a continuum of noise levels. We find that models trained this way are more robust to noise, obtain higher test likelihoods, and generate higher quality images. They can also be sampled from via a novel score-based sampling scheme which combats the classical covariate shift problem that occurs during sample generation in autoregressive models. Applying this augmentation to autoregressive image models, we obtain 3.32 bits per dimension on the Ima-geNet 64x64 dataset, and substantially improve the quality of generated samples in terms of the Frechet Inception distance (FID) -from 37.50 to 12.09 on the CIFAR-10 dataset.

1. INTRODUCTION

Maximum likelihood estimation (MLE) is arguably the gold standard for probabilistic model fitting, and serves as the de facto method for parameter estimation in countless statistical problems Bishop (2006) , across a variety of fields. Estimators obtained via MLE enjoy a number of theoretical guarantees, including consistency, efficiency, asymptotic normality, and equivariance to model reparameterizations Van der Vaart (2000) . In the field of density estimation and generative modeling, MLE models have played a key role, where autoregressive models and normalizing flows have boasted competitive performance in a bevy of domains, including images Child et al. ( 2019 2020) offer a different perspective on the data generation process. Even though sampling is also sequential, diffusion models are more robust to perturbations because, in essence, they are trained as denoising functions Ho et al. (2020) . Moreover, the update direction in each step is unconstrained (unlike token-wise autoregressive models, which can only update one token at a time, and only once), meaning the model can correct errors from previous steps. However, likelihood evaluations have no closed form, requiring ODE/SDE solvers and hundreds to thousands of calls to the underlying network, and rendering the models incapable of being trained via MLE. Diffusion models also do not inherit any of the asymptotic guarantees Hyvärinen & Dayan (2005); Song et al. (2020a) of score matching, even though they



), textVaswani et al. (2017), audio Oord et al. (2016), and tabular dataPapamakarios et al. (2017). However, while log-likelihood is broadly agreed upon as one of the most rigorous metrics for goodness-of-fit in statistical and generative modeling, models with high likelihoods do not necessarily produce samples of high visual quality. This phenomenon has been discussed at length by Theis et al. (2015); Huszár (2015), and corroborated in empirical studies Grover et al. (2018); Kim et al. (2022). Autoregressive models suffer an additional affliction: they have notoriously unstable dynamics during sample generation Bengio et al. (2015); Lamb et al. (2016) due to their sequential sampling algorithm, which can cause errors to compound across time steps. Such errors cannot usually be corrected ex post facto due to the autoregressive structure of the model, and can substantially affect downstream steps as we find that model likelihoods are highly sensitive to even the most minor of perturbations. Score-based diffusion models Song et al. (2020b); Ho et al. (

