EVC: TOWARDS REAL-TIME NEURAL IMAGE COM-PRESSION WITH MASK DECAY

Abstract

Neural image compression has surpassed state-of-the-art traditional codecs (H.266/VVC) for rate-distortion (RD) performance, but suffers from large complexity and separate models for different rate-distortion trade-offs. In this paper, we propose an Efficient single-model Variable-bit-rate Codec (EVC), which is able to run at 30 FPS with 768x512 input images and still outperforms VVC for the RD performance. By further reducing both encoder and decoder complexities, our small model even achieves 30 FPS with 1920x1080 input images. To bridge the performance gap between our different capacities models, we meticulously design the mask decay, which transforms the large model's parameters into the small model automatically. And a novel sparsity regularization loss is proposed to mitigate shortcomings of L p regularization. Our algorithm significantly narrows the performance gap by 50% and 30% for our medium and small models, respectively. At last, we advocate the scalable encoder for neural image compression. The encoding complexity is dynamic to meet different latency requirements. We propose decaying the large encoder multiple times to reduce the residual representation progressively. Both mask decay and residual representation learning greatly improve the RD performance of our scalable encoder. Our code is at https://github.com/microsoft/DCVC.

1. INTRODUCTION

The image compression based on deep learning has achieved extraordinary rate-distortion (RD) performance compared to traditional codecs (H.266/VVC) (Bross et al., 2021) . However, two main issues limit its practicability in real-world applications. One is the large complexity. Most stateof-the-art (SOTA) neural image codecs rely on complex models, such as the large capacity backbones (Zhu et al., 2022; Zou et al., 2022) , the sophisticated probability model (Cheng et al., 2020) , and the parallelization-unfriendly auto-regressive model (Minnen et al., 2018) . The large complexity easily results in unsatisfied latency in real-world applications. The second issue is the inefficient rate-control. Multiple models need to be trained and stored for different rate-distortion trade-offs. For the inference, it requires loading specific model according to the target quality. For practical purposes, a single model is desired to handle variable RD trade-offs. In this paper, we try to solve both issues to make a single real-time neural image compression model, while its RD performance is still on-par with that of other SOTA models. Inspired by recent progress, we design an efficient framework that is equipped with Depth-Conv blocks (Liu et al., 2022) and the spatial prior (He et al., 2021; Li et al., 2022a; Qian et al., 2022) . For variable RD trade-offs, we introduce an adjustable quantization step (Chen & Ma, 2020; Li et al., 2022a) for the representations. All modules within our framework are highly efficient and GPU friendly, different from recent Transformer based models (Zou et al., 2022) . Encoding and decoding (including the arithmetic coding) achieves 30 FPS for the 768 × 512 inputs. Compared with other SOTA models, ours enjoys comparable RD performance, low latency, and a single model for all RD trade-offs. To further accelerate our model, we reduce the complexity of both the encoder and decoder. Three series models are proposed: Large, Medium, and Small (cf. Appendix Tab. 3). Note that our small

