MIMT: MASKED IMAGE MODELING TRANSFORMER FOR VIDEO COMPRESSION

Abstract

Deep learning video compression outperforms its hand-craft counterparts with enhanced flexibility and capacity. One key component of the learned video codec is the autoregressive entropy model conditioned on spatial and temporal priors. Operating autoregressive on raster scanning order naively treats the context as unidirectional. This is neither efficient nor optimal considering that conditional information probably locates at the end of the sequence. We thus introduce an entropy model based on a masked image modeling transformer (MIMT) to learn the spatial-temporal dependencies. Video frames are first encoded into sequences of tokens and then processed with the transformer encoder as priors. The transformer decoder learns the probability mass functions (PMFs) conditioned on the priors and masked inputs, and then it is capable of selecting optimal decoding orders without a fixed direction. During training, MIMT aims to predict the PMFs of randomly masked tokens by attending to tokens in all directions. This allows MIMT to capture the temporal dependencies from encoded priors and the spatial dependencies from the unmasked tokens, i.e., decoded tokens. At inference time, the model begins with generating PMFs of all masked tokens in parallel and then decodes the frame iteratively from the previously-selected decoded tokens (i.e., with high confidence). In addition, we improve the overall performance with more techniques, e.g., manifold conditional priors accumulating a long range of information, shifted window attention to reduce complexity. Extensive experiments demonstrate the proposed MIMT framework equipped with the new transformer entropy model achieves state-of-the-art performance on HEVC, UVG, and MCL-JCV datasets, generally outperforming the VVC in terms of PSNR and SSIM.

1. INTRODUCTION

Videos continue to grow exponentially as demand for various video applications increases on social media platforms and mobile devices. Traditional video compression codecs, such as HEVC and VVC, are still moving toward more efficient, hardware-friendly, and versatile. However, their framework still followed a hybrid coding framework that remained unchanged decades ago: spatialtemporal prediction coding plus transformation-based residual coding. Neural video compression surged to outperform handcraft codecs by optimizing the rate-distortion loss in an end-to-end manner. One line of earlier work replaces traditional coding modules, including motion estimation, optical-flow-based warping, and residual coding modules with neural networks. Recently, residual coding has been proved to be suboptimal compared with context coding. Moreover, a pixel in frame x t is related to all pixels in the previously decoded frames x <t and pixels already decoded at x t . Due to the huge space, it is impossible for traditional video codecs to explore the correlation between all rules using handcrafted rules explicitly. Using the entropy model to exploit the spatial-temporal dependencies from the current and past decoded frames can vastly reduce data redundancies. The transformer is rising for computer vision tasks, including low-level image analysis. Inspired by the language-translation model, VCT * equal contribution † corresponding author (Mentzer et al., 2022) for the first time uses a transformer as the conditional entropy model to predict the probability mass function (PMF) from the previous frames. VCT uses the estimated probability to losslessly compress the quantized latent feature map ŷt without direct warping or residual coding modules. The better the transformer predicts the PMFs, the fewer bits are required for the video frames. For VCT, the transformer decoder is an autoregressive model which regards video frames naively as sequences of tokens and decodes the current frame y t sequentially in a raster scanning order (i.e., token-by-token). We find this strategy neither optimal nor efficient, and thus, we propose a masked image modeling transformer (MIMT) using bidirectional attention. Masked Image Modeling. Masked language modeling, first proposed in BERT (Devlin et al., 2018) , has revolutionized the field of natural language processing, significantly when scaling to large datasets and huge models (Brown et al., 2020) . The success in NLP has also been replicated in vision tasks by masking patches of pixels (He et al., 2022) or masking tokens generated by a pretrained dVAE (Bao et al., 2021; Xie et al., 2022) . Recently, these works have also been extended to other domains to learn good representations for action recognition (Tong et al., 2022; Feichtenhofer et al., 2022 ), video prediction (Gupta et al., 2022) , and image generation (Chang et al., 2022; Wu et al., 2022) .

3. METHOD

As shown in Fig. 1 , we encode a sequence of (RGB) video frames {x t } T t=1 into latent tokens {y t } T t=1 , using a CNN-based image encoder. Next, we get a temporal sequence {ŷ t-1 , . . . , ŷ1 } from the decoded frames buffer. We use decoded sequence to compress y t with the transformer



Lu et al. (2019) developed the DVC model with all modules in the traditional hybrid video codec replaced by the network. DVC-Pro is proposed with a more advanced entropy model and deeper network (Lu et al., 2020). Agustsson et al. (2020) extended optical-flow-based estimation to a 3D transformation by adding a scale dimension. Hu et al. (2020) considered ratedistortion optimization when encoding motion vectors. In Lin et al. (2020), a single reference frame is extended to multiple reference frames. Yang et al. (2020) proposed a residual encoder and decoder based on RNN to exploit accumulated temporal information. Deviated from the residual coding, DCVC (Li et al., 2021) employed contextual coding to compensate for the shortness of the residual coding scheme. Mentzer et al. (2022) proposed simplifying the "hand-craft" video compression network of explicit motion estimation, warp, and residual coding with a transformer-based temporal model. Contemporary work from Li et al. (2022) uses multiple modules, e.g., learnable quantization and parallel entropy model, to improve significantly the compression performance, which surpasses the latest VVC codec. AlphaVC (Shi et al., 2022) introduced several techniques, e.g., conditional I-frame and pixel-to-feature motion prediction, to effectively improve the rate-distortion performance.

During training, MIMT aims to optimize a proxy task similar to the mask prediction inBERT (Devlin et al., 2018)  andBEIT  (Bao et al., 2021)  to predict the PMFs of masked tokens. At inference, MIMT adopts a novel nonsequential autoregressive decoding method to predict the image in a few steps. Each step keeps the most confident (smallest entropy) token for the next iteration.Our contributions are summarized as follows: (1) We design an entropy model based on a bidirectional transformer MIMT to compress the spatial-temporal redundancy in video frames. MIMT is trained on masked image modeling tasks. It can capture temporal information from past frames and spatial information from the decoded tokens at inference time. (2) More techniques are introduced to make our video compression model versatile. We employ manifold priors, including the recurrent latent prior, to accumulate an extended range of decoded frames. To further reduce MIMT complexity, we introduce alternating transformer layers with non-overlapping shifted window attention. (3) The proposed MIMT achieves state-of-the-art compression results with all these improvements. It generally outperforms the last H.266 (VTM) in terms of PSNR and SSIM. The bitrate saving over H.266 (VTM) is 29.6% on the UVG dataset in terms of PSNR.

