IDF++: ANALYZING AND IMPROVING INTEGER DISCRETE FLOWS FOR LOSSLESS COMPRESSION

Abstract

In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states that invertible flows for discrete random variables are less flexible than their continuous counterparts. We demonstrate with a proof that this claim does not hold for integer discrete flows due to the embedding of data with finite support into the countably infinite integer lattice. Furthermore, we zoom in on the effect of gradient bias due to the straight-through estimator in integer discrete flows, and demonstrate that its influence is highly dependent on architecture choices and less prominent than previously thought. Finally, we show how different architecture modifications improve the performance of this model class for lossless compression, and that they also enable more efficient compression: a model with half the number of flow layers performs on par with or better than the original integer discrete flow model.

1. INTRODUCTION

Density estimation algorithms that minimize the cross entropy between a data distribution and a model distribution can be interpreted as lossless compression algorithms because the cross-entropy upper bounds the data entropy. While autoregressive neural networks (Uria et al., 2014; Theis & Bethge, 2015; Oord et al., 2016; Salimans et al., 2017) and variational auto-encoders (Kingma & Welling, 2013; Rezende & Mohamed, 2015) have seen practical connections to lossless compression for some time, normalizing flows were only recently used for lossless compression. Most normalizing flow models are designed for real-valued data, which complicates an efficient connection with entropy coders for lossless compression since entropy coders require discretized data. However, normalizing flows for real-valued data were recently connected to bits-back coding by Ho et al. (2019b) , opening up the possibility for efficient dataset compression with high compression rates. Orthogonal to this, Tran et al. (2019) and Hoogeboom et al. (2019a) introduced normalizing flows for discrete random variables. Hoogeboom et al. (2019a) demonstrated that integer discrete flows can be connected directly to entropy coders without the need for bits-back coding. In this paper we aim to improve integer discrete flows for lossless compression. Recent literature has proposed several hypotheses on the weaknesses of this model class, which we investigate as potential directions for improving compression performance. More specifically, we start by discussing the claim on the flexibility of normalizing flows for discrete random variables by Papamakarios et al. (2019) , and we show that this limitation on flexibility does not apply to integer discrete flows. We then continue by discussing the potential influence of gradient bias on the training of integer discrete flows. We demonstrate that other less-biased gradient estimators do not improve final results. Furthermore, through a numerical analysis on a toy example we show that the straight-through gradient estimates for 8-bit data correlate well with finite difference estimates of the gradient. We also demonstrate that the previously observed performance degradation as a function of number of flows is highly dependent on the architecture of the coupling layers. Motivated by this last finding, we introduce several architecture changes that improve the performance of this model class on lossless image compression.

2. RELATED WORK

Continuous Generative Models: Continuous generative flow-based models (Chen & Gopinath, 2001; Dinh et al., 2014; 2016) are attractive due to their tractable likelihood computation. Recently these models have demonstrated promising performance in modeling images (Ho et al., 2019a; Kingma & Dhariwal, 2018 ), audio (Kim et al., 2018 ), and video (Kumar et al., 2019) . We refer to Papamakarios et al. ( 2019) for a recent comprehensive review of the field. By discretizing the continuous latent vectors of variational auto-encoders and flow-based models, efficient lossless compression can be achieved using bits-back coding (Hinton & Van Camp, 1993) . Recent examples of such approaches are Local Bits-Back Coding with normalizing flows (Ho et al., 2019b) and variational auto-encoders with bits-back coding such as Bits-Back with ANS (Townsend et al., 2019b) , Bit-Swap (Kingma et al., 2019) and HiLLoC (Townsend et al., 2019a) . These methods achieve good performance when compressing a full dataset, such as the ImageNet test set, since the auxiliary bits needed for bits-back coding can be amortized across many samples. However, encoding a single image would require more bits than the original image itself (Ho et al., 2019b) . Learned discrete lossless compression: Producing discrete codes allows entropy coders to be directly applied to single data instances. Mentzer et al. ( 2019) encode an image into a set of discrete multiscale latent vectors that can be stored efficiently. Fully autoregressive generative models condition unseen pixels directly on the previously observed pixel values and have achieved the best likelihood values compared to other models (Oord et al., 2016; Salimans et al., 2017) . However, decoding with these models is impractically slow since the conditional distribution for each pixel has to be computed sequentially. Recently, super-resolution networks were used for lossless compression (Cao et al., 2020) by storing a low resolution image in raw format and by encoding the corrections needed for lossless up-sampling to the full image resolution with a partial autoregressive model. Finally, Mentzer et al. (2020) first encode an image using an efficient lossy compression algorithm and store the residual using a generative model conditioned on the lossy image encoding.

Hand-designed Lossless Compression Codecs:

The popular PNG algorithm (Boutell & Lane, 1997) leverages a simple autoregressive model and the DEFLATE algorithm (Deutsch, 1996) for compression. WebP (Rabbat, 2010) uses larger patches for conditional compression coupled with a custom entropy coder. In its lossless mode, JPEG 2000 (Rabbani, 2002) transforms an image using wavelet transforms at multiple scales before encoding. Lastly, FLIF (Sneyers & Wuille, 2016) uses an adaptive entropy coder that selects the local context model using a per-image learned decision tree.

3. BACKGROUND: NORMALIZING FLOWS

In this section we briefly review normalizing flows for real-valued and discrete random variables. A normalizing flow consists of a sequence of invertible functions applied to a random variable x: f K • f K-1 • ... • f 1 (x) , yielding random variables y K ← ... ← y 1 ← y 0 = x. First, consider a realvalued random variable x ∈ R d with unknown distribution p x (x). Let f : R d → R d be an invertible function that such that y = f (x) with y ∈ R d . If we impose a density p y (y) on y, the distribution p x (x) is obtained by marginalizing out y from the joint distribution p x,y (x, y) = p x|y (x|y)p y (y): p x (x) = δ(x-f -1 (y))p y (y)dy = δ(x-u)p y (f (u)) det ∂f (u) ∂u du = p y (f (x)) det ∂f (x) ∂x , where we used p x|y (x|y) = δ(x -f -1 (y)), with δ(x -x ) the Dirac delta distribution, and we applied a change of variables. Repeated application of (1) for a sequence of transformations then yields the log-probability: ln p x (x) = ln p y K (y K ) + K k=1 ln det ∂y k ∂y k-1 . (2) By parameterizing the invertible functions with invertible neural networks and by choosing a tractable distribution p y K (y K ) these models can be used to optimize the log-likelihood of x. When modeling discrete data with continuous flow models, dequantization noise must be added to the input data to ensure that a lower bound to the discrete log-likelihood is optimized (Uria et al., 2013; Theis et al., 2015; Ho et al., 2019a) .

