EVC: TOWARDS REAL-TIME NEURAL IMAGE COM-PRESSION WITH MASK DECAY

Abstract

Neural image compression has surpassed state-of-the-art traditional codecs (H.266/VVC) for rate-distortion (RD) performance, but suffers from large complexity and separate models for different rate-distortion trade-offs. In this paper, we propose an Efficient single-model Variable-bit-rate Codec (EVC), which is able to run at 30 FPS with 768x512 input images and still outperforms VVC for the RD performance. By further reducing both encoder and decoder complexities, our small model even achieves 30 FPS with 1920x1080 input images. To bridge the performance gap between our different capacities models, we meticulously design the mask decay, which transforms the large model's parameters into the small model automatically. And a novel sparsity regularization loss is proposed to mitigate shortcomings of L p regularization. Our algorithm significantly narrows the performance gap by 50% and 30% for our medium and small models, respectively. At last, we advocate the scalable encoder for neural image compression. The encoding complexity is dynamic to meet different latency requirements. We propose decaying the large encoder multiple times to reduce the residual representation progressively. Both mask decay and residual representation learning greatly improve the RD performance of our scalable encoder. Our code is at https://github.com/microsoft/DCVC.

1. INTRODUCTION

The image compression based on deep learning has achieved extraordinary rate-distortion (RD) performance compared to traditional codecs (H.266/VVC) (Bross et al., 2021) . However, two main issues limit its practicability in real-world applications. One is the large complexity. Most stateof-the-art (SOTA) neural image codecs rely on complex models, such as the large capacity backbones (Zhu et al., 2022; Zou et al., 2022) , the sophisticated probability model (Cheng et al., 2020) , and the parallelization-unfriendly auto-regressive model (Minnen et al., 2018) . The large complexity easily results in unsatisfied latency in real-world applications. The second issue is the inefficient rate-control. Multiple models need to be trained and stored for different rate-distortion trade-offs. For the inference, it requires loading specific model according to the target quality. For practical purposes, a single model is desired to handle variable RD trade-offs. In this paper, we try to solve both issues to make a single real-time neural image compression model, while its RD performance is still on-par with that of other SOTA models. Inspired by recent progress, we design an efficient framework that is equipped with Depth-Conv blocks (Liu et al., 2022) and the spatial prior (He et al., 2021; Li et al., 2022a; Qian et al., 2022) . For variable RD trade-offs, we introduce an adjustable quantization step (Chen & Ma, 2020; Li et al., 2022a) for the representations. All modules within our framework are highly efficient and GPU friendly, different from recent Transformer based models (Zou et al., 2022) . Encoding and decoding (including the arithmetic coding) achieves 30 FPS for the 768 × 512 inputs. Compared with other SOTA models, ours enjoys comparable RD performance, low latency, and a single model for all RD trade-offs. To further accelerate our model, we reduce the complexity of both the encoder and decoder. Three series models are proposed: Large, Medium, and Small (cf. Appendix Tab. 3). Note that our small that simply reducing the model's capacity results in significant performance loss. However, this crucial problem has not been solved effectively. In this paper, we advocate training small neural compression models with a teacher to mitigate this problem. In particular, mask decay is proposed to transform the large model's parameters into the small model automatically. Specifically, mask layers are first inserted into a pretrained teacher model. Then we sparsify these masks until all layers become the student's structure. Finally, masks are merged into the raw layers meanwhile keeping the network functionality unchanged. For the sparsity regularization, L p -norm based losses are adopted by most previous works (Li et al., 2016; Ding et al., 2018; Zhang et al., 2021) . But they hardly work for neural image compression (cf. Fig. 7 ). In this paper, we propose a novel sparsity regularization loss to alleviate the drawbacks of L p so that optimize our masks more effectively. With the help of our large model, our medium and small models are improved significantly by 50% and 30%, respectively (cf. Fig. 1a ). That demonstrates the effectiveness of our mask decay and reusing a large model's parameters is helpful for training a small neural image compression model. In addition, considering the various device capabilities in real-world codec applications, the encoder scalability of supporting different encoding complexities while with only one decoder is also a critical need. To achieve this, we propose compressing the cumbersome encoder multi-times to progressively bridge the performance gap. Both the residual representation learning (RRL) and mask decay treat the cumbersome encoder as a reference implicitly, and encourage the diversity of different small encoders. Therefore ours achieves superior performance than training separate encoders (cf. Fig. 1b ). And compared to SlimCAE (Yang et al., 2021) which is inspired by slimmable networks (Yu et al., 2019) , ours enjoys a simple framework and better RD performance. Our contributions are as follows. • We propose an Efficient Variable-bit-rate Codec (EVC) for image compression. It enjoys only one model for different RD trade-offs. Our model is able to run at 30 FPS for the 768 × 512 inputs, while is on-par with other SOTA models for the RD performance. Our small model even achieves 30 FPS for the 1920 × 1080 inputs. • We propose mask decay, an effective method to improve the student image compression model with the help of the teacher. A novel sparsity regularization loss is also introduced, which alleviates the shortcomings of L p regularization. Thanks to mask decay, our medium and small models are significantly improved by 50% and 30%, respectively. • We enable the encoding scalability for neural image compression. With residual representation learning and mask decay, our scalable encoder significantly narrows the performance gap from the teacher and achieves a superior RD performance than previous SlimCAE.

2. RELATED WORKS

Neural image compression is in a scene of prosperity. Ballé et al. (2017) proposes replacing the quantizer with an additive i.i.d. uniform noise, so that the neural image compression model enjoys the end-to-end training. Then, the hyperprior structure was proposed by Ballé et al. (2018) .



Figure 1: The trade-off between BD-Rate and complexities on Kodak. The anchor is VTM. (a) and (b) show the performance improvement by our mask decay and residual representation learning (RRL). We cite results from SwinT-Hyperprior (Zhu et al., 2022) for comparison.model even achieves 30 FPS for 1920 × 1080 inputs. Recent work(Zhu et al., 2022)  points out that simply reducing the model's capacity results in significant performance loss. However, this crucial problem has not been solved effectively. In this paper, we advocate training small neural compression models with a teacher to mitigate this problem. In particular, mask decay is proposed to transform the large model's parameters into the small model automatically. Specifically, mask layers are first inserted into a pretrained teacher model. Then we sparsify these masks until all layers become the student's structure. Finally, masks are merged into the raw layers meanwhile keeping the network functionality unchanged. For the sparsity regularization, L p -norm based losses are adopted by most previous works(Li et al., 2016; Ding et al., 2018; Zhang et al., 2021). But they hardly work for neural image compression (cf. Fig.7). In this paper, we propose a novel sparsity regularization loss to alleviate the drawbacks of L p so that optimize our masks more effectively. With the help of our large model, our medium and small models are improved significantly by 50% and 30%, respectively (cf. Fig.1a). That demonstrates the effectiveness of our mask decay and reusing a large model's parameters is helpful for training a small neural image compression model.

