OVERFITTING FOR FUN AND PROFIT: INSTANCE-ADAPTIVE DATA COMPRESSION

Abstract

Neural data compression has been shown to outperform classical methods in terms of rate-distortion (RD) performance, with results still improving rapidly. At a high level, neural compression is based on an autoencoder that tries to reconstruct the input instance from a (quantized) latent representation, coupled with a prior that is used to losslessly compress these latents. Due to limitations on model capacity and imperfect optimization and generalization, such models will suboptimally compress test data in general. However, one of the great strengths of learned compression is that if the test-time data distribution is known and relatively lowentropy (e.g. a camera watching a static scene, a dash cam in an autonomous car, etc.), the model can easily be finetuned or adapted to this distribution, leading to improved RD performance. In this paper we take this concept to the extreme, adapting the full model to a single video, and sending model updates (quantized and compressed using a parameter-space prior) along with the latent representation. Unlike previous work, we finetune not only the encoder/latents but the entire model, and -during finetuning -take into account both the effect of model quantization and the additional costs incurred by sending the model updates. We evaluate an image compression model on I-frames (sampled at 2 fps) from videos of the Xiph dataset, and demonstrate that full-model adaptation improves RD performance by ∼ 1 dB, with respect to encoder-only finetuning.

1. INTRODUCTION

The most common approach to neural lossy compression is to train a variational autoencoder (VAE)like model on a training dataset to minimize the expected RD cost D + βR (Theis et al., 2017; Kingma & Welling, 2013) . Although this approach has proven to be very successful (Ballé et al., 2018) , a model trained to minimize expected RD cost over a full dataset is unlikely to be optimal for every test instance because the model has limited capacity, and both optimization and generalization will be imperfect. The problem of generalization will be especially significant when the testing distribution is different from the training distribution, as is likely to be the case in practice. Suboptimality of the encoder has been studied extensively under the term inference suboptimality (Cremer et al., 2018) , and it has been shown that finetuning the encoder or latents for a particular instance can lead to improved compression performance (Lu et al., 2020; Campos et al., 2019; Yang et al., 2020b; Guo et al., 2020) . This approach is appealing as no additional information needs to be added to the bitstream, and nothing changes on the receiver side. Performance gains however are limited, because the prior and decoder can not be adapted. In this paper we present a method for full-model instance-adaptive compression, i.e. adapting the entire model to a single data instance. Unlike previous work, our method takes into account the costs for sending not only the latent prior, but also the decoder model updates, as well as quantization of these updates. This is achieved by extending the typical RD loss with an additional model rate term M that measures the number of bits required to send the model updates under a newly introduced model prior, resulting in a combined RDM loss. As an initial proof of concept, we show that this approach can lead to very substantial gains in RD performance (∼ 1 dB PSNR gain at the same bitrate) on the problem of I-frame video coding, where a set of key frames, sampled from a video at 2 fps, are independently coded using an I-frame (image compression) model. Additionally, we show how the model rate bits are distributed across the model, and (by means of an ablation study) quantify the individual gains achieved by including a model-rate loss and using quantization-aware finetuning. The rest of this paper is structured as follows. Section 2 discusses the basics of neural compression and related work on adaptive compression. Section 3 presents our method, including details on the RDM loss, the choice of the model prior, its quantization, and the (de)coding procedure. In Sections 4 and 5 we present our experiments and results, followed by a discussion in Section 6.

2.1. NEURAL DATA COMPRESSION

The standard approach to neural compression can be understood as a particular kind of VAE (Kingma & Welling, 2013) . In the compression literature the encoder q φ (z|x) is typically defined by a neural network parameterized by φ, with either deterministic output (so q φ (z|x) is one-hot) (Habibian et al., 2019) or with fixed uniform [0, 1] noise on the outputs (Ballé et al., 2018) . In both cases, sampling z ∼ q φ (z|x) is used during training while quantization is used at test time. The latent z is encoded to the bitstream using entropy coding in conjunction with a latent prior p θ (z), so that coding z takes aboutlog p θ (z) bits (up to discretization). On the receiving side, the entropy decoder is used with the same prior p θ (z) to decode z and then reconstruct x using the decoder network p θ (x|z) (note that we use the same symbol θ to denote the parameters of the prior and decoder jointly, as in our method both will have to be coded and added to the bitstream). From these considerations it is clear that the rate R and distortion D can be measured by the two terms in the following loss: L RD (φ, θ) = β E q φ (z|x) [-log p θ (z)] R + E q φ (z|x) [-log p θ (x|z)] D . This loss is equal (up to the tradeoff parameter β and an additive constant) to the standard negative evidence lower bound (ELBO) used in VAE training. The rate term of ELBO is written as a KL divergence between encoder and prior, but since D KL (q, p) = R -H[q], and the encoder entropy H[q] is constant in our case, minimizing the KL loss is equivalent to minimizing the rate loss. Neural video compression is typically decomposed into the problem of independently compressing a set of key frames (i.e. I-frames) and conditionally compressing the remaining frames (Lu et al., 2019; Liu et al., 2020; Wu et al., 2018; Djelouah et al., 2019; Yang et al., 2020a) . In this work, we specifically focus on improving I-frame compression.

2.2. ADAPTIVE COMPRESSION

A compression model is trained on a dataset D with the aim of achieving optimal RD performance on test data. However, because of limited model capacity, optimization difficulties, or insufficient data (resulting in poor generalization), the model will in general not achieve this goal. When the test data distribution differs from that of the training data, generalization will not be guaranteed even in the limit of infinite data and model capacity, and perfect optimization. A convenient feature of neural compression however is that a model can easily be finetuned on new data or data from a specific domain. A model can for instance (further) be trained after deployment,

