LVQ-VAE:END-TO-END HYPERPRIOR-BASED VARIA-TIONAL IMAGE COMPRESSION WITH LATTICE VEC-TOR QUANTIZATION

Abstract

Image compression technology has become more important research topic. In recent years, learning-based methods have been extensively studied and variational autoencoder (VAE)-based methods using hyperprior-based context-adaptive entropy model have been reported to be comparable to the latest video coding standard H.266/VVC in terms of RD performance. We think there is room for improvement in quantization process of latent features by adopting vector quantization (VQ). Many VAE-based methods use scalar quantization for latent features and do not exploit correlation between the features. Although there are methods that incorporate VQ into learning-based methods, to the best our knowledge, there are no studies that utilizes the hyperprior-based VAE with VQ because incorporating VQ into a hyperprior-based VAE makes it difficult to estimate the likelihood. In this paper, we propose a new VAE-based image compression method using VQ based latent representation for hyperprior-based context-adaptive entropy model to improve the coding efficiency. The proposed method resolves problem faced by conventional VQ-based methods due to codebook size bloat by adopting Lattice VQ as the basis quantization method and achieves end-to-end optimization with hyperprior-based context-adaptive entropy model by approximating the likelihood calculation of latent feature vectors with high accuracy using Monte Carlo integration. Furthermore, in likelihood estimation, we model each latent feature vector with multivariate normal distribution including covariance matrix parameters, which improves the likelihood estimation accuracy and RD performance. Experimental results show that the proposed method achieves a state-of-the-art RD performance exceeding existing learning-based methods and the latest video coding standard H.266/VVC by 18.0 % for Kodak, 21.9 % for CLIC2022 and 39.2 % for Tecnick.

1. INTRODUCTION

Image compression technology has become more important than ever to achieve efficient data transmission and storage due to the demand for of high-quality contents and the increase in the popularity of video services. Various conventional image compression technologies have been standardized so far (JPEG (Wallace, 1991; ITU, 1993) ). These technologies consist of a combination of transform, quantization and entropy coding. Transform is a major part of JPEG, H.265/HEVC and H.266/VVC which use DCT or DST, while JPEG2000 uses wavelet transform, all of which are based on handcrafted linear transforms. These hand-crafted design are limited in their ability to capture features for a variety of images. In recent years, deep learning has made remarkable progress, and learning-based methods are being actively explored in the field of image compression. Most recent learning-based methods are based on transform coding (Goyal, 2001) . Many of these methods use convolutional neural network (CNN)-based autoencoders in which the encoder transforms the input image into a latent representation and then performs quantization and entropy coding, while the decoder reconstructs the restored image. This approach achieves flexible nonlinear transforms that have higher potential to 1



, JPEG2000 (Taubman & Marcellin, 2002; ISO/TEC, 2004), WebP (Google), H.264/AVC (Marpe et al., 2006; ISO/IEC, 2003), H.265/HEVC (Sullivan et al., 2012; ISO/IEC, 2013), H.266/VVC (Bross et al., 2021; ISO/IEC, 2020), etc.

