LEARNING ACCURATE ENTROPY MODEL WITH GLOBAL REFERENCE FOR IMAGE COMPRESSION

Abstract

In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans decoded latents and then finds the most relevant latent to assist the distribution estimating of the current latent. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance. Experimental results demonstrate that the proposed model outperforms the rate-distortion performance of most of the state-of-the-art methods in the industry.

1. INTRODUCTION

Image compression is a fundamental research topic in computer vision. The goal of image compression is to preserve the critical visual information of the image while reducing the bit-rate for storage or transmission. The state-of-the-art image compression standards, such as JPEG (Wallace, 1992) , JPEG2000 (Rabbani & Joshi, 2002) , HEVC/H.265 (Sullivan et al., 2012) and Versatile Video Coding (VVC) (Ohm & Sullivan, 2018) , are carefully engineered and highly tuned to achieve better performance. Albeit widely deployed, the conventional human-designed codecs take decades of development to achieve impressive compression rate today. Any further improvement is expected to be even more difficult. Inspired by the successful stories of deep learning in many vision tasks, several pioneer works (Toderici et al., 2016; Agustsson et al., 2017; Theis et al., 2017; Ballé et al., 2017; Ballé et al., 2018; Mentzer et al., 2018; Lee et al., 2019; Minnen et al., 2018a) demonstrate that the image compression task can be effectively solved by deep learning too. This breakthrough allows us to use data-driven learning system to design novel compression algorithms automatically. As a result, a majority of deep image compression (DIC) models are based on autoencoder framework. In this framework, an encoder transforms pixels into a quantized latent representation suitable for compression, while a decoder is jointly optimized to transform the latent representation back into pixels. The latent representation can be losslessly compressed to create a bitstream by using entropy coding method (Rissanen & Langdon, 1981) . In the entropy coding, the compression quality is controlled by the entropy estimation of latent features generated by the encoder. It is therefore important to learn an accurate entropy model. To this end, several solutions have been considered. With additional bits, some methods propose entropy model conditioned on a hyperprior, using side information of local histograms over the latent representation (Minnen et al., 2018b) or a hierarchical learned prior (Ballé et al., 2018) . Contextadaptive models (Minnen et al., 2018a; Lee et al., 2019) incorporate predictions from neighboring symbols to avoid storing the additional bits. While these methods improve the accuracy of the entropy models, they are unable to use global context information during the compression, leading to suboptimal performance. In this work, we observe that global spatial redundancy remains in the latents, as shown in Figure 1 . Motivated by this, we propose to build up a global relevance throughout the latents. Inspired by the recent reference-based Super-Resolution (SR) methods (Zheng et al., 2018; Yang et al., 2020) , we empower the entropy model with global vision by incorporating a reference component. Unlike the super-resolution scenario, incorporating global reference information is non-trivial in deep image compression. The image during decoding is often incomplete which means that the information is badly missing. Besides, our target is to reduce the bit rates and recover the image from the bitstream faithfully, rather than inpainting a low-resolution image with vivid generated details. To address the above challenges, in our proposed method, a global reference module searches over the decoded latents to find the relevant latents to the target latent. The feature map of the relevant latent is then combined with local context and hyperprior to generate a more accurate entropy estimation. A key ingredient in the global reference ensemble step is that we consider not only the similarity between the relevant and the target but also a confidence score to measure the high-order statictics in the latent feature distribution. The introduction of the confidence score enhances the robustness of the entropy model, especially for images with noisy backgrounds. Also, we found that the widely used Generalized Divisive Normalization (GDN) in image compression suffers from a mean-shifting problem. Since the GDN densities are zero-mean by definition, mean removal is necessary to fit the density (Ballé et al., 2016b) . Therefore we propose an improved version of GDN, named GSDN (Generalized Subtractive and Divisive Normalization) to overcome this difficulty. We summarize our main contributions as follows: • To the best of our knowledge, we are the first to introduce global reference into the entropy model for deep image compression. We develop a robust reference algorithm to ensemble local context, global reference and hyperprior in a novel architecture. When estimating the latent feature entropy, both similarity score and confidence score of the reference area are considered to battle with the noisy background signals. • We propose a novel GSDN module that corrects the mean-shifting problem. • Experiments show that our method outperforms the most advanced codes available today on both PSNR and MS-SSIM quality metrics. Our method saves by 6.1% compared to the context-adaptive deep models (Minnen et al., 2018b; Lee et al., 2019) and as much as 21.0% relative to BPG (Bellard., 2014) . The remainder of this work is organized as follows. In Section 2, we introduce the backbone of the end-to-end deep image compression network as well as the reference-based component for the entropy model. Section 3 demonstrates the structure of our combined entropy model. The GSDN with mean-shifting correction is given in Section 4. We present experimental comparison and visualization in Section 5. Finally, we enclose this work with an open discussion in Section 6.

2. LEARNED IMAGE COMPRESSION

Learned image compression using deep neural networks has attracted considerable attention recently. The work of Toderici et al. (2016) first explored a recurrent architecture using an LSTMbased entropy model. A wide range of models (Ballé et al., 2017; Ballé et al., 2018; Mentzer et al., 2018; Minnen et al., 2018a; Lee et al., 2019; Hu et al., 2020; Cheng et al., 2020 ) used a CNN-based autoencoder with constrained entropy. General learned image compression consists of an encoder, a quantizer, a decoder, and an entropy model. An image x is transformed into a latent representation y via the encoder g a (x), which is discretized by the quantizer Q(y) to form ŷ. Given the entropy model p ŷ , the discretized value ŷ can be compressed into a bitstream using entropy coding techniques such as arithmetic coding (Rissanen & Langdon, 1981) . The decoder g s (ŷ) then forms the reconstructed image x from the quantized latent representation ŷ, which is decompressed from the bitstream. The training goal for learned image compression is to optimize the trade-off between the estimated coding length of the bitstream and the quality of the reconstruction, which is a rate-distortion optimization problem: L = R + λD = E x∼px [-log 2 p ŷ (Q(g a (x))))] + λE x∼px [d(x, g s (ŷ)], where λ is the coefficient which controls the rate-distortion trade-off, p x is the unknown distribution of natural images. The first term represents the estimated compression rate of the latent representation. The second term d(x, x) represents the distortion value under given metric, such as mean squared error (MSE) or MS-SSIM (Wang et al. (2003) ). Entropy coding relies on an entropy model to estimate the prior probability of the latent representation. Ballé et al. (2017) propose a fully factorized prior for entropy estimation as shown in Figure 2 (a), while the prior probability of discrete latent representations is not adaptive for different images. As shown in Figure 2 (b), Ballé et al. (2018) model the latent representation as a zero-mean Gaussian distribution based on a spatial dependency with additional bits. In Lee et al. (2019) and Minnen et al. (2018a) , they introduce an autoregressive component into the entropy model. Taking advantage of high correlation of local dependency, context-adaptive models contribute to more accurate entropy estimation. However, since their context-adaptive entropy models only capture the spatial information of neighboring latents, there is redundant spatial information across the whole image. To further remove such redundancy, our method incorporates a reference-based model to capture global spatial dependency. Specially for learned image compression, a generalized divisive normalization (GDN) (Ballé et al., 2016a) transform with optimized parameters has proven effective in Gaussianizing the local joint statistics of natural images. Unlike many other normalization methods whose parameters are typically fixed after training, GDN is spatially adaptive therefore is highly nonlinear. As the referencebased model calculates relevance over the latents, it is crucial to align the distribution of the latents. To better align the latents, the proposed GSDN incorporates a subtracting factor with GDN. We also present an effective method of inverting it when decompressing the latent representation back to image.

𝑥

𝑥 "  E D Q 𝑦 𝑦 " Context Model 𝜇 ! , 𝜎 ! Reference Model 𝜇 " ,𝜎 " 𝑦 " #$ HD HE Q AE AD Factorized Entropy Model PN PN PN AE AD 𝑦 " #$ 𝑦 " 𝜇 % ,𝜎 %

3. COMBINED LOCAL, GLOBAL AND HYPERPRIOR ENTROPY MODEL

The models we analyze in this paper build on the architecture introduced in Minnen et al. (2018a) , which combined an autoregressive model with the hyperprior. Figure 3 provides a high-level overview of our approach. The compression model contains two main sub-networks. The first is the core autoencoder, which learns the transform and the inverse transform between image and latent representation. Q represents the quantization function. The gradient-based optimization in learned methods is hindered by quantization. Here, we make use of a mixed approach that has proven efficient in Minnen & Singh (2020) . The second sub-network is the combined entropy model, which is responsible for estimating a probabilistic model over the latents for entropy coding. The combined entropy model consists of a context model, a reference model, and a hyper-network (hyper encoder and hyper decoder). The three components are combined progressively. Then three parameter networks generate the mean and scale parameters for a conditional Gaussian entropy model respectively. Following the work of Minnen et al. (2018a) , we model each latent, ŷi , as a Gaussian with mean µ i and deviation σ i convolved with a unit uniform distribution: p ŷ (ŷ|ẑ, θ) = i=1 (N (µ i , σ 2 i ) * U(-0.5, 0.5))) (ŷ i ) where µ and σ are the predicted parameters of entropy model, ẑ is the quantized hyper-latents, θ is the entropy model parameters. The entropy model for the hyperprior is the same as in Ballé et al. (2018) , which is a non-parametric, fully factorized density model. As the hyperprior is part of the compressed bitstream, we extend the rate of Equation 1 as follows: R = E x∼px [-log 2 p ŷ (ŷ)] + E x∼px [-log 2 p ẑ (ẑ)] The compressed latents and the compressed hyper-latents are part of the bitstream. The reference-based SR methods (Zheng et al., 2018; Yang et al., 2020) adopt "patch match" to search for proper reference information. However, in the serial processing of image compression, the latent representation during decoding is often incomplete. We extend this search method by using a masked patch. Figure 4 illustrates how the relevance embedding module estimates similarity and fetches the relevant latents. When decoding the target latent, we use neighboring latents (left and top) as a basis to compute the similarities between the target latent and its previous latents. Particularly, the latents are unfolded into patches and then masked, denoted as q ∈ [H × W, k × k × C] (where H, W , k, C correspond to height, width, unfold kernel size and channels, respectively). We calculate the similarity matrix r ∈ [H ×W, H ×W ] throughout the masked patches by using cosine similarity, r i,j = q i q i , q j q j (4) Note that we can only see the decoded latents, so the lower triangular of the similarity matrix is set to zero. We get the most relevant position for each latent as well as the similarity score. According to the position, we fetch the neighboring latents (left and top) as well as the center latent, which is named as "relevant latents". We use a masked convolution as in Van den Oord et al. (2016) to transfer the relevant latents. To measure how likely a reference patch perfectly matches the target patch, Yang et al. (2020) propose a soft-attention module to transfer features by using the similarity map S. However, we found that a similarity score is not sufficient to reflect the quality of reference latent in image compression. For this reason, a confidence score is introduced to measure the texture complexity of the relevant latent. We use the context model to predict the Gaussian parameters (i.e., µ 1 , σ 1 ) of latents solely. The latents ŷ are now modeled as Gaussian with mean µ 1 and standard deviation σ 1 . The probabilities of the latents are then calculated according to (µ 1 , σ 1 ) as in Equation 2. As reference model is designed in spatial dimension, the confidence map U is obtained by averaging the probabilities across channel. With the above two parameters, the more relevant latent combination would be enhanced while the less relevant one would be relived. The similarity S and the confidence U are both 2D feature maps. Figure 5 provides the structure of our combined entropy model. For the context model, we transfer the latents (i.e., ŷ) with a masked convolution. For the reference model, we transfer the unfolded relevant latents with a masked convolution. We use 1 × 1 convolution in the parameter networks. Local, global and hyperprior features are ensembled stage by stage, as well as the predicted Gaussian parameters. The mean parameters are estimated by the context model first, and then updated by the global model and the hyperprior model. We use the Log-Sum-Exp trick for resolving the under or overflow issue of deviation parameters. The output of global reference is further multiplied by the similarity S and the confidence U . The context model is based on the neighboring latents of the target latent to reduce the local redundancy. From the perspective of the global context, the reference model makes further efforts to capture spatial dependency. As the first two models predict by the decoded latents, there exists uncertainty that can not be eliminated solely. The hyperprior model learns to store information needed to reduce the uncertainty. This progressive mechanism allows an incremental accuracy for distribution estimation.

4. GENERALIZED SUBTRACTIVE AND DIVISIVE NORMALIZATION

Virtually the traditional image and video compression codec consists of several basic modules, i.e. transform, quantization, entropy coding and inverse transform. An effective transform for image compression maps from the image to a compact and decorrelated latent representation. As part of a Gaussianizing transformation, a generalized divisive normalization (GDN) joint nonlinearity has proven effective at removing statistical dependencies in image data (Ballé et al., 2016b) . It shows an impressive capacity for learned image compression. We define a generalized subtractive and divisive normalization (GSDN) transform that incorporates a subtractive operation. Inspired by the zero-mean definition of Gaussian density, an adaptive subtractive operation is applied before the divisive operation. Particularly, we apply subtractive-divisive normalization after convolution and subsampling operation in the encoder g a (except the last convolution layer). We represent the ith channel at a spatial location as u i . The normalization operation , with d representing mean squared error), the right plot shows MS-SSIM values converted to decibels (-10 log 10 (1 -d), where d is the MS-SSIM value in the range between zero and one). In both terms, our full model consistently outperforms standard codecs and the state-of-the-art learned models. is defined by: w i = u i -(ν i + j τ ij u j ) (β i + j γ ij u 2 j ) 1 2 (5) The parameter consists of two vectors (β and ν) and two matrices (γ and τ ), for a total of 2 × (N + N 2 ) parameters (where N is the channels of input feature). The normalization operation shares parameters across the spatial dimension. We invert the normalization operation in the decoder g s based on the inversion of GDN introduced in Ballé et al. (2016a) . We apply inverse GSDN (IGSDN) after deconvolution and upsampling operation (except the last deconvolution layer) correspond to the encoder. In the decoder, we represent the ith channel at a spatial location as ŵi . For the inverse solution, subtraction is replaced by addition while division is replaced by multiplication: ûi = ŵi • ( βi + j γij ŵ2 j ) 1 2 + (ν i + j τij ŵj )

Architecture

For the results in this paper, we did not make efforts to reduce the capacity (i.e. number of channels, layers) of the artificial neural networks to optimize computational complexity. The architecture of our approach extends on the work of Minnen et al. (2018a) in two ways. First, the main autoencoder is extended by replacing the GDN with the proposed GSDN (IGDN with IGSDN in decoder). Second, the entropy model is extended by incorporating a reference model. Three modules are combined in progressively.

Training

The models were trained on color PNG images from CVPR workshop CLIC training dataset (http://challenge.compression.cc/). The models were optimized using Adam (Kingma & Ba, 2014) with a batch size of 8 and a patch size of 512 × 512 randomly extracted from the training dataset. Note that large patch size is necessary for the training of the reference model. As our combined entropy model have three predictive Gaussian parameters, we first trained the three modules with weight of 0.3 : 0.3 : 0.4 as a warm-up with 1000 epochs. After that, we trained three modules with weight of 0.1 : 0.1 : 0.8 because the third output is used for entropy coding in practice. In the experiments, we trained different models with different λ to evaluate the rate-distortion performance for various ranges of bit-rate. Distortion measure We optimized the networks using two different types of distortion terms, one with MSE and the other with MS-SSIM (Wang et al., 2003) . For each distortion type, the average bits per pixel (BPP) and the distortion, PSNR and MS-SSIM, over the test set are measured for each model configurations. Other codecs For the standard codecs, we used BPG (Bellard., 2014) and JPEG (Wallace, 1992) . For the learning-based codecs, we compared state-of-the-art methods that combine spatial context with a hyperprior (Minnen et al. (2018a) ; Lee et al. (2019) ), which also shares a similar structure of our method.

5.2. RATE DISTORTION PERFORMANCE

We evaluate the effects of global reference and GSDN in learned image compression. Figure 6 shows RD curves over the publicly available Kodak dataset (Kodak, 1993) by using peak signal-tonoise ratio (PSNR) and MS-SSIM as the image quality metric. The RD graphs compare our full model (Entropy Model with Reference + GSDN) to existing image codecs. In terms of PSNR and MS-SSIM, our model shows better performance over state-of-the-art learning-based methods as well as standard codecs. Figure 7 shows the results that compares different versions of our models. Particularly, it plots the rate savings for each model relative to the curve for BPG. This visualization provides a more readable way than a standard RD graph. It shows that the four components (i.e. local context, global reference, hyperprior) yield progressive improvement. Especially, the combination of global reference provides a rate saving of 5.3% over the context-only model at the low bit rates. As we introduce the confidence U , the performance of the reference model is further improved. Our full model, which replaces GDN with GSDN, provides a rate savings about 2.0% over the proposed entropy model.

5.3. VISUAL RESULTS OF REFERENCE-BASED ENTROPY MODEL AND GSDN

Figure 8 shows the results of the relevance embedding module. The target region (indicated by purple) and its relevant region (indicated by yellow) are marked by the same numbers. As the relevance is calculated on the latents, we map the position back to the RGB image. The region box just indicates position of target latent (relevant latent) and does not represent receptive field. The relevance results explain the bit savings caused by the combining of global reference. 

6. DISCUSSION

Based on previous context-adaptive methods (Minnen et al., 2018a; Lee et al., 2019) , we have introduced a new entropy model for learned image compression. By combining global reference, we have developed a more accurate distribution estimating for the latent representation. Ideally, our combined entropy model effectively leverages both the local and global context information, which yields an enhanced compression performance. The positive results from global reference are somewhat surprising. We showed in Figure 7 that the combined entropy model provides a progressive improvement in terms of rate-distortion performance without increasing the complexity of the model. Global reference model scans decoded latents and then finds the most relevant latent to assist the distribution estimating of target latent. Our reference model is inspired by recent works of referencebased super-resolution (Zhang et al., 2019; Yang et al., 2020) . We extend this reference-based module in three ways to adapt it to image compression. First, we extend it to a single-image reference module. The second extension is that we incorporate the reference module into entropy model. The idea is that avoid influencing the highly compacted latent representation. Moreover, a confidence variable is introduced to enable adaptive reference. Intuitively, we can see how the local context and global reference are complementary as shown in Figure 9 . The improvement by the reference model also implies that current learned image compression is not ideal to model the spatial redundancy. The proposed reference model develops relevance across the latents of a single image. Reference within a single image limits its benefit. The upper part of latents has fewer decoded latents to reference. An alternative direction for future research may be to extend the reference model to multi-image compression. We also plan to investigate combining video compression with reference model to see if the two approaches are complementary.

A APPENDIX

A 1 . The output of the last layer in encoder corresponds to the latents, and the output of the last layer in hyper-encoder corresponds to the hyper-latents. The output of the last layer in decoder corresponds to the generated RGB image. The three parameter networks share a similar architecture. The only difference is the number of the first layer's input channels in parameter network. The output of the parameter network must have exactly twice as many channels as the latents. This constraint arises because the entropy model predicts two values, the mean and deviation of a Gaussian distribution, for each latent. A.3 COMPUTATIONAL COMPLEXITY Our main goal was to optimize for compression performance. To make a fair comparison with previous methods, we have taken care to match model capacities between Context+Hyperprior method (Minnen et al., 2018b) and Context-Adaptive method (Lee et al., 2019) . We did not to choose a number of filters that would limit the capacities of encoder, decoder and entropy model. Compared to Context+Hyperprior method (Minnen et al., 2018b) , the increased computation by GSDN and global reference is about 10% when processing the 512×768 Kodak images. The complexity of search module is O(N 2 ) about the size of latents, while the complexities of other modules are O(N ). Although not the main goal of our paper, the computational complexity is crucial for the application of deep learning method in industry field. We add two tables in appendix to describe computational complexity and FLOPs of our method. As you concerned, the complexity of reference model increases with image size more fast than other modules. The reason is that the search module computes similarity matrix throughout the latents. And we just use a naive way for the search module. So, there is some room to improve the global reference model and we are interested in researching this further. We have not yet optimized our compression method for computational complexity. Rather, we chose the number of filters high enough to rule out the Reference model. We just set the model larger than necessary, and let the model determine the number of channels that yield the best performance. This model relies only on an autoregressive process with a local context to predict the Gaussian parameters. The benefit of this approach is that no additional bits are added to the bitstream. The downside of this model is that it conditions predictions only on neighboring latents. 



Figure 1: Global spatial redundancy in the image. For standard codecs and previous learned codecs, non-local relevant patches (marked by yellow and blue) would consume equal bit rates.

Figure 2: Operational diagrams of learned compression models (a)(b)(c) and proposed Referencebased Entropy Model (d).

Figure

Figure4: A mask slide patch searches on all the decoded latents (tan area). The relevant latents are fetched and learned with a masked convolution.

Figure 6: Rate-distortion curves aggregated over the Kodak dataset. The left plot shows peak signal-to-noise ratios as a function of bit rate (10 log 10 255 2 d

Figure 7: Each curve shows the rate savings at different PSNR quality levels relative to BPG. Our full model outperforms BPG by 21% at low bit rates.

Figure 8: Examples of target region (indicated by purple) and its relevant region (indicated by yellow).

Figure 9 visualizes the internal mechanisms of different entropy model variants. Three variants are showed: local context only (first row), combined global reference with local context (second row), full entropy model(third row). Intuitively, we can see how these components are complemen-

Figure 11: Histograms of the latent representation by GDN-based model and GSDN-based model. Each plot corresponds to one channel of the latents over 24 Kodak images. Four channels with highest entropy are visualized.

Figure 15: At similar bit rates, our combined method provides the highest visual quality on the Kodak 21 image. BPG shows more "classical" compression artifacts, e.g., ringing around the edge of the lighthouse.

BPG (bpp=0.1289, PSNR=33.35) (a) Ours (bpp=0.1147, PSNR=33.94) (c) JPEG (bpp=0.1591, PSNR=22.53)

Figure16: At similar bit rates, our combined method provides the highest visual quality on the Kodak 23 image. Note that the BPG reconstruction has some ringing and geometric artifacts (e.g., at the top of the red parrot's head).

Figure

.1 RESULTS ON THE CLIC VALIDATION DATASET Figure12: Performance Evaluation on CLIC Validation dataset. Our method performs very well when optimized for MSE or MS-SSIM. Each point on the RD curves is calculated by averaging over the PSNR (or MS-SSIM) and bit rate for the 102 images from CLIC Validation dataset (http: //challenge.compression.cc/). Each row corresponds to a layer of our generalized model. Convolutional layers are specified with the "Conv" prefix followed by the kernel size, number of output channels and downsampling stride (e.g. the first layer of the encoder uses 5×5 kernels with 192 output channels and a stride of 2). The "Deconv" prefix corresponds to upsampled convolutions, while "Masked" corresponds to masked convolution as inVan den Oord et al. (2016). GSDN stands for generalized subtractive and divisive normalization, and IGSDN is inverse GSDN. The three parameter networks share a similar architecture.

GFLOPs of each module for the proposed method and reproducing method(Minnen et al.,  2018a). The size of the test image is 512×768.

GFLOPs of entropy model and proportion of reference model with various image size.

