NEURAL IMAGE COMPRESSION WITH A DIFFUSION-BASED DECODER

Abstract

Diffusion probabilistic models have recently achieved remarkable success in generating high quality image and video data. In this work, we build on this class of generative models and introduce a method for lossy compression of high resolution images. The resulting codec, which we call DIffuson-based Residual Augmentation Codec (DIRAC), is the first neural codec to allow smooth traversal of the rate-distortion-perception tradeoff at test time, while obtaining competitive performance with GAN-based methods in perceptual quality. Furthermore, while sampling from diffusion probabilistic models is notoriously expensive, we show that in the compression setting the number of steps can be drastically reduced.

1. INTRODUCTION

As a new set of generative approaches, denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015) have recently shown incredible performance in the generation of highresolution images with high perceptual quality. For example, they have powered large text-to-image models such as DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022b) , which are capable of producing realistic high-resolution images based on arbitrary text prompts. Likewise, diffusion models have demonstrated impressive results on image-to-image tasks such as super-resolution (Saharia et al., 2022c; Ho et al., 2022a ), deblurring (Whang et al., 2022) or inpainting (Saharia et al., 2022a) , in many cases outperforming generative adversarial networks (GANs) (Dhariwal & Nichol, 2021) . Our goal in this work is to leverage these capabilities in the context of learned compression. Neural codecs, which learn to compress from example data, are typically trained to minimize distortion between an input and a reconstruction, as well as the bitrate used to transmit the data (Theis et al., 2017) . However, optimizing for distortion may result in blurry reconstructions, and recent work has identified the importance of perceptual quality. In particular, Mentzer et al. (2020) demonstrated that a GAN-based image codec is able to output reconstructions that look more realistic to a human observer than a distortion-optimized codec, at the cost of some fidelity to the original. This tradeoff between perceptual quality and fidelity is a fundamental one (Blau & Michaeli, 2019) , and finding a good operating point is not trivial and likely application-dependent. Our approach, called DIffusion Residual Augmentation Codec (DIRAC), uses a base codec to produce an initial reconstruction with minimal distortion, then enhances its perceptual quality using a denoising diffusion probabilistic model. By varying the number of sampling steps, the DDPM can smoothly interpolate between a high fidelity output and one with high perceptual quality. Combined with a multi-rate base codec, this enables a user to navigate the axes of rate-distortion and distortionperception independently using separate control parameters. Moreover, by making a high fidelity initial prediction first, DDPM sampling can be stopped at any point, allowing the codec to operate more efficiently. We show that we can alter the sampling schedule at test time, and that we obtain strong performance with 20 sampling steps or less. In summary, this paper makes the following contributions: 1. We demonstrate one of the first practical diffusion-based codecs for high-resolution image compression, with competitive performance in both distortion and perceptual quality. 2. We show that this codec enables fine dynamic control over all independent axes of the ratedistortion-perception tradeoff at test time. To the best of our knowledge, this is the first neural codec with this ability. 3. Although DDPMs are notoriously expensive to sample from, we show that our design choices enable a substantial speedup, allowing sampling in 20 steps or fewer.

2. METHOD 2.1 DENOISING DIFFUSION PROBABILISTIC MODELS

Denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) are latent variable models in which the latent variables x 1 , ..., x T are defined as a T -step Markov chain with Gaussian transitions. Through the Markov chain, the forward process of DDPMs gradually corrupts the original data x 0 : q(x 1:T |x 0 ) = T t=1 q(x t |x t-1 ) , q(x t |x t-1 ) = N (x t ; 1 -β t x t-1 , β t I) . Here, q(x t |x t-1 ) is a Gaussian distribution that models step x t conditioned on the previous step x t-1 . It is also possible to further condition these transitions on x 0 (Song et al., 2020) . The variances β t are typically chosen empirically and referred to as the noise schedule, although they can be learned as well (Dhariwal & Nichol, 2021) . The key idea of DDPMs is that, if the corrupted data at step T follows a distribution which permits efficient sampling, we can construct a generative model by reversing the forward process. For example, the series q(x t |x t-1 ) in Eq. ( 1) converges to a standard normal distribution for appropriately chosen β t . This can be easily seen in the closed-form expression for Eq. ( 1): x t = √ ᾱt x 0 + √ 1 -ᾱt ϵ. Here ϵ ∈ N (ϵ; 0, I), and we define α t := 1 -β t and ᾱt := t s=1 α s . DDPMs are therefore generative models that learn to recover the original data by modeling the reverse transitions: p θ (x 0:T ) = p(x T ) T t=1 p θ (x t-1 |x t ) , p θ (x t-1 |x t ) = N (x t-1 ; µ θ (x t , t), Σ θ (x t , t)) . (2) Here p θ (x t-1 |x t ), parametrized by weights θ, models a Gaussian distribution that reverses state x t back to x t-1 , and the base distribution is defined as the standard normal p(x T ) := N (x T ; 0, I). The variance Σ θ is usually chosen to be σ 2 t I, and can be learned or chosen empirically. Predictions from the DDPM can then be generated by iteratively applying the model and following the transitions p θ (x t-1 |x t ). We give more details on the sampling procedure in Section 2.3 and Appendix A.5. The training objective can then be constructed in the context of variational inference, by approximating the posterior distribution of the latent variables p θ (x 1:T |x 0 ) by the forward process q(x 1:T |x 0 ). Using Bayes' rule to obtain q(x t-1 |x t , x 0 ), maximizing the evidence lower bound (ELBO) on p θ (x 0 ) is then equivalent to minimizing the sum of T Kullback-Leibler (KL) divergences. It can be shown that this loss can then be rewritten as a simple minimization between true data and a denoising prediction (more details in Appendix A.1): L DDPM = E t,x0,ϵ w t ||x 0 -g θ (x t , t)|| 2 . (3) Here g θ (x t , t) is a model that directly predicts x 0 from x t , and w t is a weighting factor. Note that this integrates t into the expectation; while the full loss should sum over all t, it is common



Ideally, one selects this operating point at test time. Although adaptive rate control is commonly used, only the work of Iwai et al. (2021) allows to balance distortion and perceptual quality dynamically. To the best of our knowledge, our work proposes the first neural codec that can navigate the full rate-distortion-perception tradeoff at test time. An earlier diffusion-based codec has been proposed by Theis et al. (2022), but their encoding scheme is prohibitively expensive. Concurrent work by Yang & Mandt (2022) demonstrated a more practical diffusion-based codec, but we show that different design choices allow us to outperform their model by a large margin.

