NEURAL IMAGE COMPRESSION WITH A DIFFUSION-BASED DECODER

Abstract

Diffusion probabilistic models have recently achieved remarkable success in generating high quality image and video data. In this work, we build on this class of generative models and introduce a method for lossy compression of high resolution images. The resulting codec, which we call DIffuson-based Residual Augmentation Codec (DIRAC), is the first neural codec to allow smooth traversal of the rate-distortion-perception tradeoff at test time, while obtaining competitive performance with GAN-based methods in perceptual quality. Furthermore, while sampling from diffusion probabilistic models is notoriously expensive, we show that in the compression setting the number of steps can be drastically reduced.

1. INTRODUCTION

As a new set of generative approaches, denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015) have recently shown incredible performance in the generation of highresolution images with high perceptual quality. For example, they have powered large text-to-image models such as DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022b) , which are capable of producing realistic high-resolution images based on arbitrary text prompts. Likewise, diffusion models have demonstrated impressive results on image-to-image tasks such as super-resolution (Saharia et al., 2022c; Ho et al., 2022a ), deblurring (Whang et al., 2022) or inpainting (Saharia et al., 2022a) , in many cases outperforming generative adversarial networks (GANs) (Dhariwal & Nichol, 2021) . Our goal in this work is to leverage these capabilities in the context of learned compression. Neural codecs, which learn to compress from example data, are typically trained to minimize distortion between an input and a reconstruction, as well as the bitrate used to transmit the data (Theis et al., 2017) . However, optimizing for distortion may result in blurry reconstructions, and recent work has identified the importance of perceptual quality. In particular, Mentzer et al. ( 2020) demonstrated that a GAN-based image codec is able to output reconstructions that look more realistic to a human observer than a distortion-optimized codec, at the cost of some fidelity to the original. This tradeoff between perceptual quality and fidelity is a fundamental one (Blau & Michaeli, 2019) , and finding a good operating point is not trivial and likely application-dependent. Our approach, called DIffusion Residual Augmentation Codec (DIRAC), uses a base codec to produce an initial reconstruction with minimal distortion, then enhances its perceptual quality using a denoising diffusion probabilistic model. By varying the number of sampling steps, the DDPM can smoothly interpolate between a high fidelity output and one with high perceptual quality. Combined with a multi-rate base codec, this enables a user to navigate the axes of rate-distortion and distortionperception independently using separate control parameters. Moreover, by making a high fidelity initial prediction first, DDPM sampling can be stopped at any point, allowing the codec to operate more efficiently. We show that we can alter the sampling schedule at test time, and that we obtain strong performance with 20 sampling steps or less.



Ideally, one selects this operating point at test time. Although adaptive rate control is commonly used, only the work of Iwai et al. (2021) allows to balance distortion and perceptual quality dynamically. To the best of our knowledge, our work proposes the first neural codec that can navigate the full rate-distortion-perception tradeoff at test time. An earlier diffusion-based codec has been proposed by Theis et al. (2022), but their encoding scheme is prohibitively expensive. Concurrent work by Yang & Mandt (2022) demonstrated a more practical diffusion-based codec, but we show that different design choices allow us to outperform their model by a large margin.

