HIERARCHICAL AUTOREGRESSIVE MODELING FOR NEURAL VIDEO COMPRESSION

Abstract

showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustsson et al., 2020) as instances of a generalized stochastic temporal autoregressive transform, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.

1. INTRODUCTION

Recent advances in deep generative modeling have seen a surge of applications, including learningbased compression. Generative models have already demonstrated empirical improvements in image compression, outperforming classical codecs (Minnen et al., 2018; Yang et al., 2020d) , such as BPG (Bellard, 2014) . In contrast, the less developed area of neural video compression remains challenging due to complex temporal dependencies operating at multiple scales. Nevertheless, recent neural video codecs have shown promising performance gains (Agustsson et al., 2020) , in some cases on par with current hand-designed, classical codecs, e.g., HEVC. Compared to hand-designed codecs, learnable codecs are not limited to specific data modality, and offer a promising approach for streaming specialized content, such as sports or video chats. Therefore, improving neural video compression is vital for dealing with the ever-growing amount of video content being created. Source compression fundamentally involves decorrelation, i.e., transforming input data into white noise distributions that can be easily modeled and entropy-coded. Thus, improving a model's capability to decorrelate data automatically improves its compression performance. Likewise, we can improve the associated entropy model (i.e., the model's prior) to capture any remaining dependencies. Just as compression techniques attempt to remove structure, generative models attempt to model structure. One family of models, autoregressive flows, maps between less structured distributions, e.g., uncorrelated noise, and more structured distributions, e.g., images or video (Dinh et al., 2014; 2016) . The inverse mapping can remove dependencies in the data, making it more amenable for compression. Thus, a natural question to ask is how autoregressive flows can best be utilized in compression, and if mechanisms in existing compression schemes can be interpreted as flows. This paper draws on recent insights in hierarchical sequential latent variable models with autoregressive flows (Marino et al., 2020) . In particular, we identify connections between this family of models and recent neural video codecs based on motion estimation (Lu et al., 2019; Agustsson et al., 2020) . By interpreting this technique as an instantiation of a more general autoregressive flow transform, we propose various alternatives and improvements based on insights from generative modeling. In more detail, our main contributions are as follows: 1. A new framework. We interpret existing video compression methods through the more general framework of generative modeling, variational inference, and autoregressive flows, allowing us to readily investigate extensions and ablations. In particular, we compare fully data-driven approaches with motion-estimation-based neural compression schemes, and consider a more expressive prior model for better entropy coding (described in the second bullet point below). This framework also provides directions for future work. 2. A new model. Following the predictive coding paradigm of video compression (Wiegand et al., 2003) , Scale-Space Flow (SSF) (Agustsson et al., 2020) uses motion estimation to predict the frame being compressed, and further compresses the residual obtained by subtraction. Our proposed model extends the SSF model with a more flexible decoder and prior, and improves over the state of the art in rate-distortion performance. Specifically, we • Incorporate a learnable scaling transform to allow for more expressive and accurate reconstruction. Augmenting a shift transform by scale-then-shift is inspired by improvements from extending NICE (Dinh et al., 2014) to RealNVP (Dinh et al., 2016) . • Introduce a structured prior over the two sets of latent variables in the generative model of SSF, corresponding to jointly encoding the motion information and residual information. As the two tend to be spatially correlated, encoding residual information conditioned on motion information results in a more informed prior, and thus better entropy model, for the residual information; this cuts down the bit-rate for the latter that typically dominates the overall bit-rate. 3. A new dataset. The neural video compression community currently lacks large, highresolution benchmark datasets. While we extensively experimented on the publicly available Vimeo-90k dataset (Xue et al., 2019) , we also collected and utilized a larger dataset, YouTube-NTfoot_0 , available through executable scripts. Since no training data was publicly released for the previous state-of-the-art method (Agustsson et al., 2020) , YouTube-NT would be a useful resource for making and comparing further progress in this field.

2. RELATED WORK

We divide related work into three categories: neural image compression, neural video compression, and sequential generative models. Neural Image Compression. Considerable progress has been made by applying neural networks to image compression. Early works proposed by Toderici et al. (2017) and Johnston et al. (2018) leveraged LSTMs to model spatial correlations of the pixels within an image. Theis et al. (2017) first proposed an autoencoder architecture for image compression and used the straight-through estimator (Bengio et al., 2013) for learning a discrete latent representation. The connection to probabilistic generative models was drawn by Ballé et al. (2017) , who firstly applied variational autoencoders (VAEs) (Kingma & Welling, 2013) to image compression. In subsequent work, Ballé et al. (2018) encoded images with a two-level VAE architecture involving a scale hyper-prior, which can be further improved by autoregressive structures (Minnen et al., 2018; Minnen & Singh, 2020) or by optimization at encoding time (Yang et al., 2020d (Vondrick et al., 2016; Lee et al., 2018) or autoregressive models and normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2014; 2016; Kingma & Dhariwal, 2018; Kingma et al., 2016; Papamakarios et al., 2017) . Recently, Marino et al. (2020) proposed to combine latent variable models with autoregressive flows for modeling dynamics at different levels of abstraction, which inspired our models and viewpoints.

3. VIDEO COMPRESSION THROUGH DEEP AUTOREGRESSIVE MODELING

We identify commonalities between hierarchical autoregressive flow models (Marino et al., 2020) and state-of-the-art neural video compression architectures (Agustsson et al., 2020) , and will use this viewpoint to propose improvements on existing models.

3.1. BACKGROUND

We first review VAE-based compression schemes (Ballé et al., 2017) and formulate existing lowlatency video codecs in this framework; we then review the related autoregressive flow model. Generative Modeling and Source Compression. Given a a sequence of video frames x 1:T , lossy compression seeks a compact description of x 1:T that simultaneously minimizes the description length R and information loss D. The distortion D measures the reconstruction error caused by encoding x 1:T into a lossy representation z1:T and subsequently decoding it back to x1:T , while R measures the bit rate (file size). In learned compression methods (Ballé et al., 2017; Theis et al., 2017) , the above process is parameterized by flexible functions f ("encoder") and g ("decoder") that map between the video and its latent representation z1:T = f (x 1:T ). The goal is to minimize a rate-distortion loss, with the tradeoff between the two controlled by a hyperparameter β > 0: L = D(x 1:T , g( z1:T )) + βR( z1:T ). We adopt the end-to-end compression approach of Ballé et al. (2017) , which approximates the rounding operations • (required for entropy coding) by uniform noise injection to enable gradientbased optimization during training. With an appropriate choice of probability model p(z 1:T ), the relaxed version of above R-D (rate-distortion) objective then corresponds to the VAE objective: L = E q(z 1:T |x 1:T ) [-log p(x 1:T |z 1:T ) -log p(z 1:T )]. In this model, the likelihood p(x 1:T |z 1:T ) follows a Gaussian distribution with mean x1:T = g(z 1:T ) and diagonal covariance β 2 log 2 I, and the approximate posterior q is chosen to be a unit-width uniform distribution (thus has zero differential entropy) whose mean z1:T is predicted by an amortized inference network f . The prior density p(z 1:T ) interpolates its discretized version, so that it measures the code length of discretized z1:T after entropy-coding.

Low-Latency Sequential Compression

We specialize Eq. 1 to make it suitable for low-latency video compression, widely used in both conventional and recent neural codecs (Rippel et al., 2019; Agustsson et al., 2020) . To this end, we encode and decode individual frames x t in sequence. Given the ground truth current frame x t and the previously reconstructed frames x<t , the encoder is restricted to be of the form zt = f (x t , x<t ), and similarly the decoder computes reconstruction sequentially based on previous reconstructions and the current encoding, xt = g(x <t , zt )). Existing codecs usually condition on a single reconstructed frame, substituting x<t by xt-1 in favor of efficiency. In the language of variational inference, the sequential encoder corresponds to a variational posterior of the form q(z t |x t , z <t ), i.e., filtering, and the sequential decoder corresponds to the likelihood p(x t |z ≤t ) = N (x t , β 2 log 2 I); in both distributions, the probabilistic conditioning on z <t is based on the observation that xt-1 is a deterministic function of z <t , if we identify zt with the random variable z t and unroll the recurrence xt = g(x <t , z t ). As we show, all sequential compression approaches considered in this work follow this paradigm, and the form of the reconstruction transform x determines the lowest hierarchy of the corresponding generative process of video x.

Masked Autoregressive Flow (MAF).

As a final component in neural sequence modeling, we discuss MAF (Papamakarios et al., 2017) , which models the joint distribution of a sequence p(x 1:T ) in terms of a simpler distribution of its underlying noise variables y 1:T through the following autoregressive transform and its inverse: x t = h µ (x <t ) + h σ (x <t ) y t ; ⇔ y t = xt-hµ(x<t) hσ(x<t) . (2) The noise variable y t usually comes from a standard normal distribution. While the forward MAF transforms a sequence of standard normal noises into a data sequence, the inverse flow "whitens" the data sequence and removes temporal correlations. Due to its invertible nature, MAF allows for exact likelihood computations, but as we will explain in Section 3.3, we will not exploit this aspect in compression but rather draw on its expressiveness in modeling conditional likelihoods.

3.2. A GENERAL FRAMEWORK FOR GENERATIVE VIDEO CODING

We now describe a general framework that captures several existing low-latency neural compression methods as specific instances and gives rise to the exploration of new models. To this end, we combine latent variable models with autoregressive flows into a joint framework. We consider a sequential decoding procedure of the following form: xt = h µ (x t-1 , w t ) + h σ (x t-1 , w t ) g v (v t , w t ). Eq. 3 resembles the definition of the MAF in Eq. 2, but augments this transform with two sets of latent variables w t , v t ∼ p(w t , v t ). Above, h µ and h σ are functions that transform the previous reconstructed data frame xt-1 along with w t into a shift and scale parameter, respectively. The function g v (v t , w t ) converts these latent variables into a noise variable that encodes residuals with respect to the mean next-frame prediction h µ (x t-1 , w t ). This stochastic decoder model has several advantages over existing generative models for compression, such as simpler flows or sequential VAEs. First, the stochastic autoregressive transform h µ (x t-1 , w t ) involves a latent variable w t and is therefore more expressive than a deterministic transform (Schmidt & Hofmann, 2018; Schmidt et al., 2019) . Second, compared to MAF, the additional nonlinear transform g v (v t , w t ) enables more expressive residual noise, reducing the burden on the entropy model. Finally, as visualized in Figure 2 , the scale parameter h σ (x t-1 , w t ) effectively acts as a gating mechanism, determining how much variance is explained in terms of the autoregressive transform and the residual noise distribution. This provides an added degree of flexibility, in a similar fashion to how RealNVP improves over NICE (Dinh et al., 2014; 2016) . Our approach is inspired by Marino et al. (2020) who analyzed a restricted version of the model in Eq. 3, aiming to hybridize autoregressive flows and sequential latent variable models for video prediction. In contrast to Eq. 3, their model involved deterministic transforms as well as residual noise that came from a sequential VAE.

3.3. EXAMPLE MODELS AND EXTENSIONS

Next, we will show that the general framework expressed by Eq. 3 captures a variety of state-of-theart neural video compression schemes and gives rise to extensions and new models. Temporal Autoregressive Transform (TAT). The first special case among the class of models that are captured by Eq. 3 is the autoregressive neural video compression model by Yang et al. (2020b) , which we refer to as temporal autoregressive transform (TAT). Shown in Figure 1 (a), the decoder g implements a deterministic scale-shift autoregressive transform of decoded noise y t , xt = g(z t , xt-1 ) = h µ (x t-1 ) + h σ (x t-1 ) y t , y t = g z (z t ). The encoder f inverts the transform to decorrelate the input frame x t into ȳt and encodes the result as zt = f (x t , xt-1 ) = f z (ȳ t ), where ȳt = xt-hµ(xt-1) hσ(xt-1) . The shift h µ and scale h σ transforms are parameterized by neural networks, f z is a convolutional neural network (CNN), and g z is a deconvolutional neural network (DNN) that approximately inverts f z . The TAT decoder is a simple version of the more general stochastic autoregressive transform in Eq 3, where h µ and h σ lack latent variables. Indeed, interpreting the probabilistic generative process of x, TAT implements the model proposed by Marino et al. (2020) , as the transform from y to x is a MAF. However, the generative process corresponding to compression (reviewed in Section 3.1) adds additional white noise to x, with x := x + , ∼ N (0, β 2 log 2 I). Thus, the generative process from y to x is no longer an autoregressive flow. Regardless, TAT was shown to better capture the low-level dynamics of video frames than the autoencoder (f z , g z ) alone, and the inverse transform decorrelates raw video frames to simplify the input to the encoder f z (Yang et al., 2020b) . DVC (Lu et al., 2019) and Scale-Space Flow (SSF, Agustsson et al. (2020) ). The second class of models captured by Eq. 3 belong to the conventional video compression framework based on predictive coding (Cutler, 1952; Wiegand et al., 2003; Sullivan et al., 2012) ; both models make use of two sets of latent variables z 1:T = {w 1:T , v 1:T } to capture different aspects of information being compressed, where w captures estimated motion information used in warping prediction, and v helps capture residual error not predicted by warping. Like most classical approaches to video compression by predictive coding, the reconstruction transform in the above models has the form of a prediction shifted by residual error (decoded noise), and lacks the scaling factor h σ compared to the autoregressive transform in Eq. 3 xt = h warp (x t-1 , g w (w t )) + g v (v t , w t ). (5) Above, g w and g v are DNNs, o t := g w (w t ) has the interpretation of an estimated optical flow (motion) field, h warp is the computer vision technique of warping, and the residual r t := g v (v t , w t ) = xth warp (x t-1 , o t ) represents the prediction error unaccounted for by warping. Lu et al. (2019) only makes use of v t in the residual decoder g v , and performs simple 2D warping by bi-linear interpretation; SSF (Agustsson et al., 2020) augments the optical flow (motion) field o t with an additional scale field, and applies scale-space-warping to the progressively blurred versions of xt-1 to allow for uncertainty in the warping prediction. The encoding procedure in the above models compute the variational mean parameters as wt = f w (x t , xt-1 ), vt = f v (x t -h warp (x t-1 , g w (w t ))), corresponding to a structured posterior q(z t |x t , z <t ) = q(w t |x t , z <t )q(v t |w t , x t , z <t ). We illustrate the above generative and inference procedures in Figure 1(b) . Proposed: models based on Stochastic Temporal Autoregressive Transform. Finally, we consider the most general models as described by the stochastic autoregressive transform in Eq. 3, shown in Figure 1 (c). We study two main variants, categorized by how they implement h µ and h σ : STAT uses DNNs for h µ and h σ as in (Yang et al., 2020b) , but complements it with the latent variable w t that characterizes the transform. In principle, more flexible transforms should give better compression performance, but we find the following variant more parameter efficient in practice: STAT-SSF: a less data-driven variant of the above that still uses scale-space warping (Agustsson et al., 2020) in the shift transform, i.e., h µ (x t-1 , w t ) = h warp (x t-1 , g w (w t )). This can also be seen as an extended version of the SSF model, whose shift transform h µ is preceded by a new learned scale transform h σ . Structured Prior (SP). Besides improving the autoregressive transform (affecting the likelihood model for x t ), one variant of our approach also improves the topmost generative hierarchy in the form of a more expressive latent prior p(z 1:T ), affecting the entropy model for compression. We observe that motion information encoded in w t can often be informative of the residual error encoded in v t . In other words, large residual errors v t incurred by the mean prediction h µ (x t-1 , w t ) (e.g., the result of warping the previous frame h µ (x t-1 )) are often spatially collocated with (unpredictable) motion as encoded by w t . The original SSF model's prior factorizes as p(w t , v t ) = p(w t )p(v t ) and does not capture such correlation. We therefore propose a structured prior by introducing conditional dependence between w t and v t , so that p(w t , v t ) = p(w t )p(v t |w t ). At a high level, this can be implemented by introducing a new neural network that maps w t to parameters of a parametric distribution of p(v t |w t ) (e.g., mean and variance of a diagonal Gaussian distribution). This results in variants of the above models, STAT-SP and STAT-SSF-SP, where the structured prior is applied on top of the proposed STAT and STAT-SSF models.

4. EXPERIMENTS

In this section, we train our models both on the existing dataset and our new YouTube-NT dataset. Our model also improves over state-of-the-art neural and classical video compression methods when evaluated on several publicly available benchmark datasets. Lower-level modules and training scheme for our models largely follow Agustsson et al. (2020) ; we provide detailed model diagrams and schematic implementation, including the proposed scaling transform and structured prior, in Appendix A.4. We also implement a more computationally efficient version of scale-space warping (Agustsson et al., 2020) based on Gaussian pyramid and interpolation (instead of naive Gaussian blurring of Agustsson et al. (2020) ); pseudocode is available in Appendix A.3.

4.1. TRAINING DATASETS

Vimeo-90k (Xue et al., 2019) consists of 90,000 clips of 7 frames at 448x256 resolution collected from vimeo.com, which has been used in previous works (Lu et al., 2019; Yang et al., 2020a; Liu et al., 2020) . While other publicly-available video datasets exist, they typically have lower resolution and/or specialized content. e.g., Kinetics (Carreira & Zisserman, 2017) only contains human action videos, and previous methods that trained on Kinetics (Wu et al., 2018; Habibian et al., 2019; Golinski et al., 2020) generally report worse rate-distortion performance on diverse benchmarks (such as UVG, to be discussed below), compared to Agustsson et al. (2020) who trained on a significantly larger and higher-resolution dataset collected from youtube.com. YouTube-NT. This is our new dataset. We collected 8,000 nature videos and movie/video-game trailers from youtube.com and processed them into 300k high-resolution (720p) clips, which we refer to as YouTube-NT. We release YouTube-NT in the form of customizable scripts to facilitate future compression research. Table 1 compares the current version of YouTube-NT with Vimeo-90k (Xue et al., 2019) and with Google's closed-access training dataset (Agustsson et al., 2020) . Figure 5b shows the evaluation performance of the SSF model architecture after training on each dataset. Evaluation. We evaluate compression performance on the widely used UVG (Mercat et al., 2020) and MCL JCV (Wang et al., 2016) datasets, both consisting of raw videos in YUV420 format. UVG is widely used for testing the HEVC codec and contains seven 1080p videos at 120fps with smooth and mild motions or stable camera movements. MCL JCV contains thirty 1080p videos at 30fps, which are generally more diverse, with a higher degree of motion and a more unstable camera. We compute the bit rate (bits-per-pixel, BPP) and the reconstruction quality (measured in PSNR) averaged across all frames. We note that PSNR is a more challenging metric than MS-SSIM (Wang et al., 2003) for learned codecs (Lu et al., 2019; Agustsson et al., 2020; Habibian et al., 2019; Yang et al., 2020a; c) . Since existing neural compress methods assume video input in RGB format (24bits/pixel), we follow this convention in our evaluations for meaningful comparisons. We note that HEVC also has special support for YUV420 (12bits/pixel), allowing it to exploit this more compact file format and effectively halve the input bitrate on our test videos (which were coded in YUV420 by default), giving it an advantage over all neural methods. Regardless, we report the performance of HEVC in YUV420 mode (in addition to the default RGB mode) for reference. even in its favored YUV 420 mode and state-of-the-art neural method SSF (Agustsson et al., 2020) , as well as the established DVC (Lu et al., 2019) model, which leverages a more complicated model and multi-stage training procedure. We also note that, as expected, our proposed STAT model improves over TAT (Yang et al., 2020b) , with the latter lacking stochasticity in the autoregressive transform compared to our proposed STAT and its variants. Figure 3a shows that the performance ranking on MCL JCV is similar to on UVG, despite MCL JCV having more diverse and challenging (e.g., animated) content (Agustsson et al., 2020) . We provide qualitative results in Figure 2 and 4 , offering insight into the behavior of the proposed scaling transform and structured prior, as well as visual qualities of the top-performing methods.

4.4. ABLATION ANALYSIS

Using the baseline SSF (Agustsson et al., 2020 ) model, we study the performance contribution of each of our proposed components, stochastic temporal autoregressive transform (STAT) and structured prior (SP), in isolation. We trained on YouTube-NT and evaluated on UVG. As shown in Figure 5a , STAT improves performance to a greater degree than SP, while SP alone does not provide noticeable improvement over vanilla SSF (however, note that when combined with STAT, SP offers higher improvement over STAT alone, as shown by STAT-SSF-SP v.s. STAT-SSF in Figure 3a ). To quantify the effect of training data on performance, we compare the test performance (on UVG) of the SSF model trained on Vimeo-90k (Xue et al., 2019) and YouTube-NT. We also provide the reported results from Agustsson et al. (2020) , which trained on a larger (and unreleased) dataset. As seen from the R-D curves in Figure 5b , training on YouTube-NT improves rate-distortion performance over Vimeo-90k, in many cases bridging the gap with the performance from the larger closed-access training dataset of Agustsson et al. (2020) . At a higher bitrate, the model trained on Vimeo-90k (Xue et al., 2019) tends to have a similar performance to YouTube-NT. This is likely because YouTube-NT currently only covers 8000 videos, limiting the diversity of the short clips.

5. DISCUSSION

We provide a unifying perspective on sequential video compression and temporal autoregressive flows (Marino et al., 2020) , and elucidate the relationship between the two in terms of their underlying generative hierarchy. From this perspective, we consider several video compression methods, particularly a state-of-the-art method Scale-Space-Flow (Agustsson et al., 2020) , as sequential variational autoencoders that implement a more general stochastic temporal autoregressive transform, which allows us to naturally extend the Scale-Space-Flow model and obtain improved ratedistortion performance on standard public benchmark datasets. Further, we provide scripts to generate a new high-resolution video dataset, YouTube-NT, which is substantially larger than current publicly-available datasets. Together, we hope that this new perspective and dataset will drive further progress in the nascent yet highly impactful field of learned video compression. The left two subfigures show the decoder and encoder flowcharts for w t and v t , respectively, with "AT" denoting autoregressive transform. The right two subfigures show the prior distributions that are used for entropy coding w t and v t , respectively, with p(w t , w h t ) = p(w h t )p(w t |w h t ), and p(v t , v h t |w t , w h t ) = p(v h t )p(v t |v h t , w t , w h t ), with w h t and v h t denoting hyper latents (see (Agustsson et al., 2020) for a description of hyper-priors); note that the priors in the SSF and STAT-SSF models (without the proposed structured prior) correspond to the special case where the HyperDecoder for v t does not receive w h t and w t as inputs, so that the entropy model for v t is independent of w t : p(v t , v h t ) = p(v h t )p(v t |v h t ).



https://github.com/privateyoung/Youtube-NT



Figure 1: Model diagrams for the generative and inference procedures of the current frame x t , for various neural video compression methods. Random variables are shown in circles, all other quantities are deterministically computed; solid and dashed arrows describe computational dependency during generation (decoding) and inference (encoding), respectively. Purple nodes correspond to neural encoders (CNNs) and decoders (DNNs), and green nodes implement temporal autoregressive transform. (a) TAT; (b) SSF; (c) STAT or STAT-SSF; the magenta box highlights the additional proposed scale transform absent in SSF, and the red arrow from w t to v t highlights the proposed (optional) structured prior. See Appendix Fig. 7 for computational diagrams of the structured prior.

Figure2: Visualizing the proposed STAT-SSF-SP model on one frame of UVG video "Shake-NDry". Two methods in comparison, STAT-SSF (proposed) and SSF(Agustsson et al., 2020), have comparable reconstruction quality to STAT-SSF-SP but higher bit-rate; the (BPP, PSNR) for STAT-SSF-SP, STAT-SSF, and SSF are (0.046, 36.97), (0.053, 36.94), and (0.075, 36.97), respectively. In this example, the warping prediction μt = h µ (x t-1 , wt ) incurs large error around the dog's moving contour, but models the mostly static background well, with the residual latents vt taking up an order of magnitude higher bit-rate than wt in the three methods. The proposed scale parameter σt gives the model extra flexibility when combining the noise ŷt (decoded from ( vt , wt )) with the warping prediction μt (decoded from wt only) to form the reconstruction xt = μt + σt ŷt : the scale σt downweights contribution from the noise ŷt in the foreground where it is very costly, and reduces the residual bit-rate R( vt ) (and thus the overall bit-rate) compared to STAT-SSF and SSF (with similar reconstruction quality), as illustrated in the third and fourth figures in the top row.

Figure 3: Rate-Distortion Performance of various models and ablations. Results are evaluated on (a) UVG and (b) MCL JCV datasets. All the learning-based models (except VCII (Wu et al., 2018)) are trained on Vimeo-90k. STAT-SSF-SP (proposed) achieves the best performance.

Figure 4: Qualitative comparisons of various methods on a frame from MCL-JCV video 30. Figures in the bottom row focus on the same image patch on top. Here, models with the proposed scale transform (STAT-SSF and STAT-SSF-SP) outperform the ones without, yielding visually more detailed reconstructions at lower rates; structured prior (STAT-SSF-SP) reduces the bit-rate further.

Figure 5: Ablations & Comparisons. (a) An ablation study on our proposed components. (b) Performance of SSF (Agustsson et al., 2020) trained on different datasets. Both sets of results are evaluated on UVG.

Figure 7: Computational flowchart for the proposed STAT-SSF-SP model. The left two subfigures show the decoder and encoder flowcharts for w t and v t , respectively, with "AT" denoting autoregressive transform. The right two subfigures show the prior distributions that are used for entropy coding w t and v t , respectively, with p(w t , w h t ) = p(w h t )p(w t |w h t ), and p(v t , v h t |w t , w h t ) = p(v h t )p(v t |v h t , w t , w h t ), with w h t and v h t denoting hyper latents (see(Agustsson et al., 2020) for a description of hyper-priors); note that the priors in the SSF and STAT-SSF models (without the proposed structured prior) correspond to the special case where the HyperDecoder for v t does not receive w h t and w t as inputs, so that the entropy model for v t is independent of w t : p(v t , v h t ) = p(v h t )p(v t |v h t ).

).Yang et al. (2020e)  andFlamich et al. (2019) demonstrated competitive image compression performance without a pre-defined quantization grid.

Overview of Training Datasets. All models are trained on three consecutive frames and batchsize 8, which are randomly selected from each clip, then randomly cropped to 256x256. We trained on MSE loss, following similar procedure to Agustsson et al. (2020) (see Appendix A.2 for details).

Overview of compression methods and the datasets trained on (if applicable).We trained our models on Vimeo-90k to compare with the published results of baseline models listed in Table2. Figure3acompares our proposed models (STAT-SSF, STAT-SSF-SP) with previous state-of-the-art classical codec HEVC and neural codecs on the UVG test dataset. Our STAT-SSF-SP model provides superior performance at bitrates ≥ 0.07 BPP, outperforming conventional HEVC

6. ACKNOWLEDGEMENTS

We gratefully acknowledge extensive contributions from Yang Yang (Qualcomm), which were indispensable to this work. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0021. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Yibo Yang acknowledges funding from the Hasso Plattner Foundation. Furthermore, this work was supported by the National Science Foundation under Grants 1928718, 2003237 and 2007719, as well as Intel and Qualcomm. 

A APPENDIX A.1 COMMAND FOR HEVC CODEC

To avoid FFmpeg package taking the advantage of the input file color format (YUV420), we first need to dump the video.yuv file to a sequence of lossless png files:Then we use the default low-latency setting in ffmpeg to compress the dumped png sequences:where crf is the parameter for quality control. The compressed video is encoded by HEVC with RGB color space.To get the result of HEVC (YUV420), we directly execute:Training time is about four days on an NVIDIA Titan RTX. Similar to Agustsson et al. (2020) , we use the Adam optimizer (Kingma & Ba, 2015) , training the models for 1,050,000 steps. The initial learning rate of 1e-4 is decayed to 1e-5 after 900,000 steps, and we increase the crop size to 384x384 for the last 50,000 steps. All models are optimized using MSE loss.A.3 EFFICIENT SCALE-SPACE-FLOW IMPLEMENTATION Agustsson et al. (2020) uses a simple implementation of scale-space flow by convolving the previous reconstructed frame xt-1 with a sequence of Gaussian kernel σ 2 = {0, σ 2 0 , (2σ 0 ) 2 , (4σ 0 ) 2 , (8σ 0 ) 2 , (16σ 0 ) 2 }. However, this leads to a large kernel size when σ is large, which can be computationally expensive. For example, a Gaussian kernel with σ 2 = 256 usually requires kernel size 97x97 to avoid artifact (usually kernel size = (6 * σ + 1) 2 ). To alleviate the problem, we leverage an efficient version of Gaussian scale-space by using Gaussian pyramid with upsampling. In our implementation, we use, because by using Gaussian pyramid, we can always use Gaussian kernel with σ = σ 0 to consecutively blur and downsample the image to build a pyramid. At the final step, we only need to upsample all the downsampled images to the original size to approximate a scale-space 3D tensor. Detailed algorithm is described in Algorithm 1.Algorithm 1: An efficient algorithm to build a scale-space 3D tensorResult: ssv: Scale-space 3D tensor Input: input input image; σ 0 base scale; M scale depth; ssv = [input]; kernel = Create Gaussian Kernel(σ 0 ); for i=0 to M-1 do input = GaussianBlur(input, kernel); if i == 0 then ssv.append(input); else tmp = input; for j=0 to i-1 do tmp = UpSample2x(tmp); {step upsampling for smooth interpolation}; end ssv.append(tmp); end input = DownSample2x(input); end return Concat(ssv) A.4 LOWER-LEVEL ARCHITECTURE DIAGRAMS Figure 6 illustrates the low-level encoder, decoder and hyper-en/decoder modules used in our proposed STAT-SSF and STAT-SSF-SP models, as well as in the baseline TAT and SSF models, based on Agustsson et al. (2020) . Figure 7 shows the encoder-decoder flowchart for w t and v t separately, as well as their corresponding entropy models (priors), in the STAT-SSF-SP model.Figure 6 : Backbone module architectures, where "5x5/2, 128" means 5x5 convolution kernel with stride 2; the number of filters is 128.

