

Abstract

Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypotheses Dropout (MH-Dropout). On ImageNet 64⇥64, MVQ reduces FID in existing vector quantization architectures by up to 68% at 2 tokens per instance and 57% at 5 tokens. These improvements widen as codebook entries is reduced and allows for 7-45⇥ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.

1. INTRODUCTION

In deep generative modelling, the choice of latent representation is an important consideration due to trade-offs in sample stability, quality, model size and compatibility with different modalities. Generative models with continuous latent random variables that assume parametric distributions have led the field in sample quality for the past decade (Vahdat & Kautz, 2020; Donahue & Simonyan, 2019) . Despite their impressive performance, these models can be challenging to stabilise, resulting in problems such as posterior collapse (Lucas et al., 2019a; He et al., 2019; Lucas et al., 2019b) . Interest in discrete representations to address these challenges has seen a revival recently with the development of several discrete autoencoders (Van Den Oord et al., 2017; Razavi et al., 2019; Ramesh et al., 2021; Esser et al., 2021; Nichol et al., 2021; Rombach et al., 2022) with improved stability in high-dimensional visual and audio domains. This approach maps each instance to a discrete sequence of codebook indices (tokens) using vector quantisation (VQ) (Van Den Oord et al., 2017) , Gumbel-softmax (Jang et al., 2016) or Concrete distributions (Maddison et al., 2016) . In a secondary training stage, an autoregressive probabilistic model, such as a PixelCNN (Van Oord et al., 2016) or Transformer (Vaswani et al., 2017) , learns the categorical posterior, representing the distribution of observable token sequences. Despite these improvements, growing the representational capacity of discrete autoencoders remains tied to increasing the number of tokens assigned to each instance and the number of total codebook entries. Increasing both hyper-parameters can scale performance to high quality images as seen in Razavi et al. (2019) where the authors used 1280 tokens per image. However this resulted in unrealistically long ancestral sampling times and significantly higher computational resource use to fit the categorical posterior. More recent works, such as VQGAN (Esser et al., 2021) and Stable Diffusion (Rombach et al., 2022) , introduced a discriminator (Goodfellow et al., 2014) and diffusion models (Sohl-Dickstein et al., 2015) to reduce token length to 256 per image however total sampling times remain rather long at 7258 seconds per image with a single Nvidia Titan X. Whilst demonstrating impressive quality, these times may continue to hinder the use of discrete autoencoders in challenging domains such as video generation and real-time applications particularly where compute resources are constrained. To address this problem, we explore whether each instance can be compressed into a shorter sequence of tokens using a smaller codebook. We achieve this goal by introducing Masked Vector Quantization

