LEARNING ACCURATE ENTROPY MODEL WITH GLOBAL REFERENCE FOR IMAGE COMPRESSION

Abstract

In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans decoded latents and then finds the most relevant latent to assist the distribution estimating of the current latent. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance. Experimental results demonstrate that the proposed model outperforms the rate-distortion performance of most of the state-of-the-art methods in the industry.

1. INTRODUCTION

Image compression is a fundamental research topic in computer vision. The goal of image compression is to preserve the critical visual information of the image while reducing the bit-rate for storage or transmission. The state-of-the-art image compression standards, such as JPEG (Wallace, 1992) , JPEG2000 (Rabbani & Joshi, 2002) , HEVC/H.265 (Sullivan et al., 2012) and Versatile Video Coding (VVC) (Ohm & Sullivan, 2018) , are carefully engineered and highly tuned to achieve better performance. Albeit widely deployed, the conventional human-designed codecs take decades of development to achieve impressive compression rate today. Any further improvement is expected to be even more difficult. Inspired by the successful stories of deep learning in many vision tasks, several pioneer works (Toderici et al., 2016; Agustsson et al., 2017; Theis et al., 2017; Ballé et al., 2017; Ballé et al., 2018; Mentzer et al., 2018; Lee et al., 2019; Minnen et al., 2018a) demonstrate that the image compression task can be effectively solved by deep learning too. This breakthrough allows us to use data-driven learning system to design novel compression algorithms automatically. As a result, a majority of deep image compression (DIC) models are based on autoencoder framework. In this framework, an encoder transforms pixels into a quantized latent representation suitable for compression, while a decoder is jointly optimized to transform the latent representation back into pixels. The latent representation can be losslessly compressed to create a bitstream by using entropy coding method (Rissanen & Langdon, 1981) . In the entropy coding, the compression quality is controlled by the entropy estimation of latent features generated by the encoder. It is therefore important to learn an accurate entropy model. To this end, several solutions have been considered. With additional bits, some methods propose entropy model conditioned on a hyperprior, using side information of local histograms over the latent representation (Minnen et al., 2018b) or a hierarchical learned prior (Ballé et al., 2018) . Contextadaptive models (Minnen et al., 2018a; Lee et al., 2019) incorporate predictions from neighboring symbols to avoid storing the additional bits. While these methods improve the accuracy of the entropy models, they are unable to use global context information during the compression, leading to suboptimal performance. In this work, we observe that global spatial redundancy remains in the latents, as shown in Figure 1 . Motivated by this, we propose to build up a global relevance throughout the latents. Inspired by the recent reference-based Super-Resolution (SR) methods (Zheng et al., 2018; Yang et al., 2020) , we empower the entropy model with global vision by incorporating a reference component. Unlike the super-resolution scenario, incorporating global reference information is non-trivial in deep image compression. The image during decoding is often incomplete which means that the information is badly missing. Besides, our target is to reduce the bit rates and recover the image from the bitstream faithfully, rather than inpainting a low-resolution image with vivid generated details. To address the above challenges, in our proposed method, a global reference module searches over the decoded latents to find the relevant latents to the target latent. The feature map of the relevant latent is then combined with local context and hyperprior to generate a more accurate entropy estimation. A key ingredient in the global reference ensemble step is that we consider not only the similarity between the relevant and the target but also a confidence score to measure the high-order statictics in the latent feature distribution. The introduction of the confidence score enhances the robustness of the entropy model, especially for images with noisy backgrounds. Also, we found that the widely used Generalized Divisive Normalization (GDN) in image compression suffers from a mean-shifting problem. Since the GDN densities are zero-mean by definition, mean removal is necessary to fit the density (Ballé et al., 2016b) . Therefore we propose an improved version of GDN, named GSDN (Generalized Subtractive and Divisive Normalization) to overcome this difficulty. We summarize our main contributions as follows: • To the best of our knowledge, we are the first to introduce global reference into the entropy model for deep image compression. We develop a robust reference algorithm to ensemble local context, global reference and hyperprior in a novel architecture. When estimating the latent feature entropy, both similarity score and confidence score of the reference area are considered to battle with the noisy background signals. • We propose a novel GSDN module that corrects the mean-shifting problem. • Experiments show that our method outperforms the most advanced codes available today on both PSNR and MS-SSIM quality metrics. Our method saves by 6.1% compared to the context-adaptive deep models (Minnen et al., 2018b; Lee et al., 2019) and as much as 21.0% relative to BPG (Bellard., 2014).



Figure 1: Global spatial redundancy in the image. For standard codecs and previous learned codecs, non-local relevant patches (marked by yellow and blue) would consume equal bit rates.

