LVQ-VAE:END-TO-END HYPERPRIOR-BASED VARIA-TIONAL IMAGE COMPRESSION WITH LATTICE VEC-TOR QUANTIZATION

Abstract

Image compression technology has become more important research topic. In recent years, learning-based methods have been extensively studied and variational autoencoder (VAE)-based methods using hyperprior-based context-adaptive entropy model have been reported to be comparable to the latest video coding standard H.266/VVC in terms of RD performance. We think there is room for improvement in quantization process of latent features by adopting vector quantization (VQ). Many VAE-based methods use scalar quantization for latent features and do not exploit correlation between the features. Although there are methods that incorporate VQ into learning-based methods, to the best our knowledge, there are no studies that utilizes the hyperprior-based VAE with VQ because incorporating VQ into a hyperprior-based VAE makes it difficult to estimate the likelihood. In this paper, we propose a new VAE-based image compression method using VQ based latent representation for hyperprior-based context-adaptive entropy model to improve the coding efficiency. The proposed method resolves problem faced by conventional VQ-based methods due to codebook size bloat by adopting Lattice VQ as the basis quantization method and achieves end-to-end optimization with hyperprior-based context-adaptive entropy model by approximating the likelihood calculation of latent feature vectors with high accuracy using Monte Carlo integration. Furthermore, in likelihood estimation, we model each latent feature vector with multivariate normal distribution including covariance matrix parameters, which improves the likelihood estimation accuracy and RD performance. Experimental results show that the proposed method achieves a state-of-the-art RD performance exceeding existing learning-based methods and the latest video coding standard H.266/VVC by 18.0 % for Kodak, 21.9 % for CLIC2022 and 39.2 % for Tecnick.

1. INTRODUCTION

Image compression technology has become more important than ever to achieve efficient data transmission and storage due to the demand for of high-quality contents and the increase in the popularity of video services. Various conventional image compression technologies have been standardized so far (JPEG (Wallace, 1991; ITU, 1993) , JPEG2000 (Taubman & Marcellin, 2002; ISO/TEC, 2004) , WebP (Google), H.264/AVC (Marpe et al., 2006; ISO/IEC, 2003) , H.265/HEVC (Sullivan et al., 2012; ISO/IEC, 2013) , H.266/VVC (Bross et al., 2021 ; ISO/IEC, 2020), etc.). These technologies consist of a combination of transform, quantization and entropy coding. Transform is a major part of JPEG, H.265/HEVC and H.266/VVC which use DCT or DST, while JPEG2000 uses wavelet transform, all of which are based on handcrafted linear transforms. These hand-crafted design are limited in their ability to capture features for a variety of images. In recent years, deep learning has made remarkable progress, and learning-based methods are being actively explored in the field of image compression. Most recent learning-based methods are based on transform coding (Goyal, 2001) . Many of these methods use convolutional neural network (CNN)-based autoencoders in which the encoder transforms the input image into a latent representation and then performs quantization and entropy coding, while the decoder reconstructs the restored image. This approach achieves flexible nonlinear transforms that have higher potential to map pixels into a more compressible latent representation than the linear transforms used by classical image compression approaches. It can be divided into two types according to the metric used for encoder optimization. One is the generative approach that directly maximizes subjective image quality (Rippel & Bourdev, 2017; Santurkar et al., 2018; Agustsson et al., 2019; Mentzer et al., 2020; Kudo et al., 2021) . This approach aims to optimize the distribution of reconstructed images to approach that of natural images by using generative adversarial networks. The other maximizes an objective metric such as peak signal-to-noise ratio (PSNR). This approach solves the rate-distortion (RD) optimization problem in the same way as the classical image compression described above. This paper focuses on the latter approach as it is applicable to a wider range of applications. The latter approach is found in various proposals. Toderici et al. (2016; 2017) 2021) focused on improving the network architecture by adopting a normalizing flow module. Some methods have been reported to better the RD performance of H.266/VVC, the latest video coding standard, which is not learning based, in terms of the MS-SSIM metric, but are still comparable in the PSNR metric. Vector quantization (VQ) is incorporated into learning-based methods to leverage performance. Since VQ potentially offers better performance than scalar quantization in terms of RD (Gray & Neuhoff, 1998; Gray, 1984; Chou et al., 1989) , various studies have examined it (Shin & Lu, 1991; Antonini et al., 1992; Tatsaki et al., 1995; Shnaider & Paplinski, 2001; Voinson et al., 2002; Salleh & Soraghan, 2007; Chiranjeevi & Jena, 2018; Nag, 2019) . The challenge in applying VQ to learning-based compression methods is how to incorporate the likelihood estimation of latent feature vectors into the optimization process. van 2019) combined optimization with supervised learning. However, all of the above methods do not perform well because the learning parameters underlying the encoder/decoder and codebook are designed separately to achieve gradient optimization and/or the probability distribution is assumed to be uniform. Agustsson et al. (2017) attempts end-to-end optimization by performing VQ using a soft-to-hard annealing strategy. However, its learning suffers from unstable convergence, because it approximates the prior distribution of the codebook with a histogram taken from the training process. Zhu et al. (2022a) proposed a cascaded vector quantization with multi-codebooks to keep memory



introduced recurrent neural networks for feature extraction and Johnston et al. (2017) enhanced these networks to improve the coding performance. Cai & Zhang (2018); Cai et al. (2018) directly trained the quantization. These methods quantize the latent features as fixed-length codes. By contrast, variational autoencoder (VAE)-based methods have been proposed that formulate the optimization as the problem of minimizing the entropy of the quantized latent features as well as the expected distortion of the reconstructed image with respect to the original. The first image compression method using VAE was proposed by Theis et al. (2017) and Ballé et al. (2017). They studied entropy models to approximate the actual distributions of the quantized latent features. To improve the accuracy of the entropy model, a hyperprior-based context-adaptive entropy model was proposed by Ballé et al. (2018); it has been the baseline in most subsequent research. Whereas the actual distributions of each latent feature are fixed in (Theis et al., 2017; Ballé et al., 2017), Ballé et al. (2018) approximated the entropy model as a zero-mean Gaussian distribution with scale parameter for each latent feature to remove the spatial redundancy, where contexts are encoded as side information. Based on this hyperprior-based context-adaptive entropy model, various methods have been proposed to estimate the entropy model with higher accuracy. The autoregressive context model is one of the technologies that has experienced significant performance improvements. Minnen et al. (2018) and Lee et al. (2019) proposed to jointly utilize an autoregressive context model and the mean and scale hyperpriors. Mentzer et al. (2018) and Chen et al. (2021) extended an autoregressive context model that utilizes channel neighbors with 3D Masked Convolution module. In (Minnen & Singh, 2020) and (Zhu et al., 2022b), the channel-directed autoregressive model was applied to reduce the computational complexity of the spatial-directed autoregressive model and He et al. (2022) was further improved by dividing the model non-evenly into channel directions. To further improve the entropy model, Liu et al. (2020) and Cheng et al. (2020) proposed a Gaussian mixture model and developed a network architecture by adopting an attention module. As another improvement perspective, Hu et al. (2020) proposed coarse-to-fine hyperprior modeling while Yang et al. (2020) improved the performance by devising an inference process without changing the training process. Ho et al. (2021) and Xie et al. (

den Oord et al. (2017) proposed VQ-VAE which avoids likelihood calculations by assuming uniformity of the prior distribution of latent features and separately designing encoder/decoder and codebooks to perform gradient optimization. Razavi et al. (2019) and Fauw et al. (2019) extended VQ-VAE to a hierarchical network structure. Williams et al. (2020) revised the quantization process while Xue et al. (

