RATE-DISTORTION OPTIMIZED POST-TRAINING QUANTIZATION FOR LEARNED IMAGE COMPRESSION

Abstract

Quantizing floating-point neural network to its fixed-point representation is crucial for Learned Image Compression (LIC) because it ensures the decoding consistency for interoperability and reduces space-time complexity for implementation. Existing solutions often have to retrain the network for model quantization which is time consuming and impractical. This work suggests the use of Post-Training Quantization (PTQ) to directly process pretrained, off-the-shelf LIC models. We theoretically prove that minimizing the mean squared error (MSE) in PTQ is suboptimal for compression task and thus develop a novel Rate-Distortion (R-D) Optimized PTQ (RDO-PTQ) to best retain the compression performance. Such RDO-PTQ just needs to compress few images (e.g., 10) to optimize the transformation of weight, bias, and activation of underlying LIC model from its native 32-bit floating-point (FP32) format to 8-bit fixed-point (INT8) precision for fixedpoint inference onwards. Experiments reveal outstanding efficiency of the proposed method on different LICs, showing the closest coding performance to their floating-point counterparts. And, our method is a lightweight and plug-and-play approach without any need of model retraining which is attractive to practitioners.

1. INTRODUCTION

Compressed images are used vastly in networked applications for efficient information sharing, which continuously drives the pursuit of better compression technologies for the past decades (Wallace, 1992; Sullivan et al., 2012; Bross et al., 2021) . Built upon the advances of deep neural networks (DNNs), recent years have witnessed the explosive growth of image compression solutions (Ballé et al., 2018; Minnen et al., 2018; Chen et al., 2021; Cheng et al., 2020; Hu et al., 2021; Lu et al., 2022) with superior efficiency to well-known rules-based JPEG (Wallace, 1992) , HEVC Intra (BPG) (Sullivan et al., 2012) , and even Versatile Video Coding Based Intra Profile (VVC Intra) (Bross et al., 2021) . Nevertheless, existing learned image compression (LIC) approaches typically adopt the floatingpoint format for data representation (e.g., weight, bias, activation), which not only consumes excessive amount of space-time complexity but also brings up the platform inconsistency and decoding failures (He et al., 2022) . To tackle these for practical application, model quantization is usually applied to generate fixed-point (or integer) LICs (Ballé et al., 2018; Hong et al., 2020; Sun et al., 2021) . Popular Quantization-Aware Training (QAT) (Bhalgat et al., 2020; Le et al., 2022; Sun et al., 2021) was mainly used in (Ballé et al., 2018; Hong et al., 2020; Sun et al., 2020; 2021) to transform floating-point LIC to its fixed-point representation. Such methods requires model re-training with the full access of labels which is expensive and impractical. Recently, Post-Training Quantization (PTQ) (Nagel et al., 2020; 2021) offered a lightweight and plug-and-play solution to directly quantize pretrained, off-the-shelf network models without requiring model retraining. However, such PTQ scheme was mostly dedicated for high-level vision tasks as studied in (Choukroun et al., 2019; Liu et al., 2021) . This work therefore extends the use of PTQ to image compression model quantization. Considering the optimization complexity, the RDO-PTQ is executed from one network layerfoot_0 to anther (e.g., layerwisely) to process weight, bias and activation in either convolutional or self-attention computation to respectively determine their proper ranges for quantization. In current implementation, it just compresses a tiny calibration image set (e.g., less than 10 images) to optimize relevant quantization factors like range, offset, etc for a fixedpoint model. Given that the distribution of both weight and activation varies across channels at each network layer, the range is adapted channel-wisely besides the layer-wise adaptation (see Fig. 3 ); Besides the range determination of the bias, a bias rescaling is applied to ensure the computation using INT8 data tensor strictly. Contribution. 1) We suggest the use of PTQ to quantize LIC model for a lightweight, plug-andplay solution by just compressing fewer image samples to derive the fixed-point model without any model retraining; 2) Both rate and distortion metrics are optimized jointly at compression task inference stage to determine proper ranges of weight, bias and activation for quantization in proposed RDO-PTQ; 3) Our method is generalized to a variety of LIC models, demonstrating the closest compression efficiency between native FP32 and corresponding quantized INT8 model.

2. RELATED WORK

Learned Image Compression (LIC). As shown in Fig. 1 , popular LICs are mainly built upon the Variational Auto-Encoder (VAE) architecture to find rate-distortion optimized compact representation of input image. In Ballé et al. (2018) , on top of the GDN (Generalized Divisive Normalization) based nonlinear transform, a hyper prior modeled by a factorized distribution was introduced to better capture the distribution of latent features. Shortly, the use of joint hyper prior and autoregressive neighbors for entropy context modeling was developed in Minnen et al. (2018) , demonstrating better efficiency than the BPG (e.g., a HEVC Intra implementation). Later then, stacked convolution with simple ReLU was used in (Cheng et al., 2020; Chen et al., 2021) to replace GDN and the attention mechanism was augmented for better information embedding, which, for the first time, outperformed the VVC Intra. Recalling that the principle behind the image coding is to find content-dependent models (e.g., transform, statistical distribution) for more compact representation, apparently, solutions simply stacking convolutions are not capable of efficiently characterizing the content dynamics because of the fixed receptive field and fixed network parameters of a trained convolutional neural network (CNN). To enable the content-adaptive dynamic embedding, self-attention mechanism was extended in (Qian et al., 2021; Lu et al., 2022; Lu & Ma, 2022) . As extensively studied in (Lu et al., 2022; Lu & Ma, 2022) , an integrated convolution



For simplicity, we refer the "network layer" to as the "layer".



Figure 1: Learned Image Compression (LIC). g a (g s ) is main encoder (decoder); AE/AD is arithmetic encoding/decoding using p Ŷ ( Ŷ ). Either convolution or self-attention is used to derive Ŷ of input X. Model quantization Q is applied at every layer (convolutional or self-attention) to transform weight w, bias b and activation x in native FP32 to INT8 precision.

