HIERARCHICAL AUTOREGRESSIVE MODELING FOR NEURAL VIDEO COMPRESSION

Abstract

showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustsson et al., 2020) as instances of a generalized stochastic temporal autoregressive transform, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.

1. INTRODUCTION

Recent advances in deep generative modeling have seen a surge of applications, including learningbased compression. Generative models have already demonstrated empirical improvements in image compression, outperforming classical codecs (Minnen et al., 2018; Yang et al., 2020d) , such as BPG (Bellard, 2014) . In contrast, the less developed area of neural video compression remains challenging due to complex temporal dependencies operating at multiple scales. Nevertheless, recent neural video codecs have shown promising performance gains (Agustsson et al., 2020) , in some cases on par with current hand-designed, classical codecs, e.g., HEVC. Compared to hand-designed codecs, learnable codecs are not limited to specific data modality, and offer a promising approach for streaming specialized content, such as sports or video chats. Therefore, improving neural video compression is vital for dealing with the ever-growing amount of video content being created. Source compression fundamentally involves decorrelation, i.e., transforming input data into white noise distributions that can be easily modeled and entropy-coded. Thus, improving a model's capability to decorrelate data automatically improves its compression performance. Likewise, we can improve the associated entropy model (i.e., the model's prior) to capture any remaining dependencies. Just as compression techniques attempt to remove structure, generative models attempt to model structure. One family of models, autoregressive flows, maps between less structured distributions, e.g., uncorrelated noise, and more structured distributions, e.g., images or video (Dinh et al., 2014; 2016) . The inverse mapping can remove dependencies in the data, making it more amenable for compression. Thus, a natural question to ask is how autoregressive flows can best be utilized in compression, and if mechanisms in existing compression schemes can be interpreted as flows. This paper draws on recent insights in hierarchical sequential latent variable models with autoregressive flows (Marino et al., 2020) . In particular, we identify connections between this family of models and recent neural video codecs based on motion estimation (Lu et al., 2019; Agustsson et al., 2020) . By interpreting this technique as an instantiation of a more general autoregressive flow transform, we propose various alternatives and improvements based on insights from generative modeling. In more detail, our main contributions are as follows: 1. A new framework. We interpret existing video compression methods through the more general framework of generative modeling, variational inference, and autoregressive flows, allowing us to readily investigate extensions and ablations. In particular, we compare fully data-driven approaches with motion-estimation-based neural compression schemes, and consider a more expressive prior model for better entropy coding (described in the second bullet point below). This framework also provides directions for future work. 2. A new model. Following the predictive coding paradigm of video compression (Wiegand et al., 2003) 

2. RELATED WORK

We divide related work into three categories: neural image compression, neural video compression, and sequential generative models. 



https://github.com/privateyoung/Youtube-NT



Image Compression. Considerable progress has been made by applying neural networks to image compression. Early works proposed by Toderici et al. (2017) and Johnston et al. (2018) leveraged LSTMs to model spatial correlations of the pixels within an image. Theis et al. (2017) first proposed an autoencoder architecture for image compression and used the straight-through estimator (Bengio et al., 2013) for learning a discrete latent representation. The connection to probabilistic generative models was drawn by Ballé et al. (2017), who firstly applied variational autoencoders (VAEs) (Kingma & Welling, 2013) to image compression. In subsequent work, Ballé et al. (2018) encoded images with a two-level VAE architecture involving a scale hyper-prior, which can be further improved by autoregressive structures (Minnen et al., 2018; Minnen & Singh, 2020) or by optimization at encoding time (Yang et al., 2020d). Yang et al. (2020e) and Flamich et al. (2019) demonstrated competitive image compression performance without a pre-defined quantization grid. Neural Video Compression. Compared to image compression, video compression is a significantly more challenging problem, as statistical redundancies exist not only within each video frame (exploited by intra-frame compression) but also along the temporal dimension. Early works by Wu et al. (2018); Djelouah et al. (2019) and Han et al. (2019) performed video compression by predicting future frames using a recurrent neural network, whereas Chen et al. (2019) and Chen et al. (2017) used convolutional architectures within a traditional block-based motion estimation approach. These early approaches did not outperform the traditional H.264 codec and barely surpassed the MPEG-2 codec. Lu et al. (2019) adopted a hybrid architecture that combined a pre-trained Flownet (Dosovitskiy et al., 2015) and residual compression, which leads to an elaborate training scheme. Habibian et al. (2019) and Liu et al. (2020) combined 3D convolutions for dimensionality reduction with expressive autoregressive priors for better entropy modeling at the expense of parallelism and runtime efficiency. Our method extends a low-latency model proposed by Agustsson et al. (2020), which allows for end-to-end training, efficient online encoding and decoding, and parallelism.

, Scale-Space Flow (SSF)(Agustsson et al., 2020)  uses motion estimation to predict the frame being compressed, and further compresses the residual obtained by subtraction. Our proposed model extends the SSF model with a more flexible decoder and prior, and improves over the state of the art in rate-distortion performance. Specifically, we • Incorporate a learnable scaling transform to allow for more expressive and accurate reconstruction. Augmenting a shift transform by scale-then-shift is inspired by improvements from extending NICE (Dinh et al., 2014) to RealNVP (Dinh et al., 2016). • Introduce a structured prior over the two sets of latent variables in the generative model of SSF, corresponding to jointly encoding the motion information and residual information. As the two tend to be spatially correlated, encoding residual information conditioned on motion information results in a more informed prior, and thus better entropy model, for the residual information; this cuts down the bit-rate for the latter that typically dominates the overall bit-rate. 3. A new dataset. The neural video compression community currently lacks large, highresolution benchmark datasets. While we extensively experimented on the publicly available Vimeo-90k dataset (Xue et al., 2019), we also collected and utilized a larger dataset, YouTube-NT 1 , available through executable scripts. Since no training data was publicly released for the previous state-of-the-art method (Agustsson et al., 2020), YouTube-NT would be a useful resource for making and comparing further progress in this field.

