MIMT: MASKED IMAGE MODELING TRANSFORMER FOR VIDEO COMPRESSION

Abstract

Deep learning video compression outperforms its hand-craft counterparts with enhanced flexibility and capacity. One key component of the learned video codec is the autoregressive entropy model conditioned on spatial and temporal priors. Operating autoregressive on raster scanning order naively treats the context as unidirectional. This is neither efficient nor optimal considering that conditional information probably locates at the end of the sequence. We thus introduce an entropy model based on a masked image modeling transformer (MIMT) to learn the spatial-temporal dependencies. Video frames are first encoded into sequences of tokens and then processed with the transformer encoder as priors. The transformer decoder learns the probability mass functions (PMFs) conditioned on the priors and masked inputs, and then it is capable of selecting optimal decoding orders without a fixed direction. During training, MIMT aims to predict the PMFs of randomly masked tokens by attending to tokens in all directions. This allows MIMT to capture the temporal dependencies from encoded priors and the spatial dependencies from the unmasked tokens, i.e., decoded tokens. At inference time, the model begins with generating PMFs of all masked tokens in parallel and then decodes the frame iteratively from the previously-selected decoded tokens (i.e., with high confidence). In addition, we improve the overall performance with more techniques, e.g., manifold conditional priors accumulating a long range of information, shifted window attention to reduce complexity. Extensive experiments demonstrate the proposed MIMT framework equipped with the new transformer entropy model achieves state-of-the-art performance on HEVC, UVG, and MCL-JCV datasets, generally outperforming the VVC in terms of PSNR and SSIM.

1. INTRODUCTION

Videos continue to grow exponentially as demand for various video applications increases on social media platforms and mobile devices. Traditional video compression codecs, such as HEVC and VVC, are still moving toward more efficient, hardware-friendly, and versatile. However, their framework still followed a hybrid coding framework that remained unchanged decades ago: spatialtemporal prediction coding plus transformation-based residual coding. Neural video compression surged to outperform handcraft codecs by optimizing the rate-distortion loss in an end-to-end manner. One line of earlier work replaces traditional coding modules, including motion estimation, optical-flow-based warping, and residual coding modules with neural networks. Recently, residual coding has been proved to be suboptimal compared with context coding. Moreover, a pixel in frame x t is related to all pixels in the previously decoded frames x <t and pixels already decoded at x t . Due to the huge space, it is impossible for traditional video codecs to explore the correlation between all rules using handcrafted rules explicitly. Using the entropy model to exploit the spatial-temporal dependencies from the current and past decoded frames can vastly reduce data redundancies. The transformer is rising for computer vision tasks, including low-level image analysis. Inspired by the language-translation model, VCT * equal contribution † corresponding author 1

