

Abstract

Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypotheses Dropout (MH-Dropout). On ImageNet 64⇥64, MVQ reduces FID in existing vector quantization architectures by up to 68% at 2 tokens per instance and 57% at 5 tokens. These improvements widen as codebook entries is reduced and allows for 7-45⇥ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.

1. INTRODUCTION

In deep generative modelling, the choice of latent representation is an important consideration due to trade-offs in sample stability, quality, model size and compatibility with different modalities. Generative models with continuous latent random variables that assume parametric distributions have led the field in sample quality for the past decade (Vahdat & Kautz, 2020; Donahue & Simonyan, 2019) . Despite their impressive performance, these models can be challenging to stabilise, resulting in problems such as posterior collapse (Lucas et al., 2019a; He et al., 2019; Lucas et al., 2019b) . Interest in discrete representations to address these challenges has seen a revival recently with the development of several discrete autoencoders (Van Den Oord et al., 2017; Razavi et al., 2019; Ramesh et al., 2021; Esser et al., 2021; Nichol et al., 2021; Rombach et al., 2022) with improved stability in high-dimensional visual and audio domains. This approach maps each instance to a discrete sequence of codebook indices (tokens) using vector quantisation (VQ) (Van Den Oord et al., 2017 ), Gumbel-softmax (Jang et al., 2016) or Concrete distributions (Maddison et al., 2016) . In a secondary training stage, an autoregressive probabilistic model, such as a PixelCNN (Van Oord et al., 2016) or Transformer (Vaswani et al., 2017) , learns the categorical posterior, representing the distribution of observable token sequences. Despite these improvements, growing the representational capacity of discrete autoencoders remains tied to increasing the number of tokens assigned to each instance and the number of total codebook entries. Increasing both hyper-parameters can scale performance to high quality images as seen in Razavi et al. (2019) where the authors used 1280 tokens per image. However this resulted in unrealistically long ancestral sampling times and significantly higher computational resource use to fit the categorical posterior. More recent works, such as VQGAN (Esser et al., 2021) and Stable Diffusion (Rombach et al., 2022) , introduced a discriminator (Goodfellow et al., 2014) and diffusion models (Sohl-Dickstein et al., 2015) to reduce token length to 256 per image however total sampling times remain rather long at 7258 seconds per image with a single Nvidia Titan X. Whilst demonstrating impressive quality, these times may continue to hinder the use of discrete autoencoders in challenging domains such as video generation and real-time applications particularly where compute resources are constrained. To address this problem, we explore whether each instance can be compressed into a shorter sequence of tokens using a smaller codebook. We achieve this goal by introducing Masked Vector Quantization (MVQ), a novel variant of VQ that allows the masked configuration of each codebook vector to be individually mapped to separate instances. More precisely, in our framework each instance is encoded with three components: a primary code e, a secondary code e 0 and a mask m j , as shown in Figure 1 . In comparison to standard VQ, each primary code can represent a further 2 D 0 instances, where D 0 is the embedding dimension of the secondary code. Masked Vector Quantization Vector Quantization ! ! ! ! ⊙ .. ! ! ! " ! ! 1 0 0 0 0 1 1 0 1 0 0 1 # $ " = ! + &(! ! ⊙ ( " ) During training, the best primary and secondary code vector for each instance can be found using standard nearest neighbor lookup from the codebook. It is, however, computationally infeasible to search for the best mask across all 2 D 0 possibilities in latent spaces of dimension typically used in practice. To overcome this, we introduce Multiple Hypotheses Dropout (MH-Dropout) a novel variant of dropout (Hinton et al., 2012) that incorporates multiple hypotheses training (Guzman-Rivera et al., 2012; Rupprecht et al., 2017) . During the forward pass, J < 2 D 0 masks are randomly sampled, yielding J latent representations. During the backward pass, we use a winner-takes-all reconstruction loss where only the best of the J representations affects the gradient. The MVQ framework improves existing VQ architectures, such as VQ-VAE2 (Razavi et al., 2019) and VQGAN (Esser et al., 2021) , particularly when the number of tokens per instance and codebook size is reduced. Across multiple medium resolution datasets, we observe FID reductions up to 82% at 2 tokens per instance, 57% at 5 tokens and 14% at 17 tokens. These improvements consistently grow as the number of codebook entries decrease and also reduce token sampling times by 58-87% on a consumer grade GPU and 80-97% on CPU. We also find that the dimensions of masked codes learn characteristics that can be smoothly interpolated, combined and transferred to other primary codes. These features are sometimes clearly interpretable (such as the presence of eyeglasses or a smile) and can correspond to individual dimensions as seen in Figure 2 . Our paper is outlined as follows: In Section 2, we review related work on discrete autoencoders, dropout and multiple hypotheses training. In Section 3, we present background on the popular vector quantization framework which our work builds upon. In Section 4, we introduce our Masked Vector



Figure 1: MVQ (bottom) compared to VQ (top). A unique pair of primary and secondary code vectors can encode up to 2 D 0 reconstructions. Multiple Hypotheses Dropout trains sampled mask configurations of the secondary code vector to represent different factors of variation (age, hair, skin tone, background, etc.). For example, by comparing D(e) to x2 , the masked secondary code vector represents older age and make-up skin tone.

