LEARNING DISCRETE REPRESENTATION WITH OPTIMAL TRANSPORT QUANTIZED AUTOENCODERS

Abstract

Vector quantized variational autoencoder (VQ-VAE) has recently emerged as a powerful generative model for learning discrete representations. Like other vector quantization methods, one key challenge of training VQ-VAE comes from the codebook collapse, i.e. only a fraction of codes are used, limiting its reconstruction qualities. To this end, VQ-VAE often leverages some carefully designed heuristics during the training to use more codes. In this paper, we propose a simple yet effective approach to overcome this issue through optimal transport, which regularizes the quantization by explicitly assigning equal number of samples to each code. The proposed approach, named OT-VAE, enforces the full utilization of the codebook while not requiring any heuristics such as stop-gradient, exponential moving average, and codebook reset. We empirically validate our approach on three different data modalities: images, speech and 3D human motions. For all the modalities, OT-VAE shows better reconstruction with higher perplexity than other VQ-VAE variants on several datasets. In particular, OT-VAE achieves stateof-the-art results on the AIST++ dataset for 3D dance generation. Our code will be released upon publication.

1. INTRODUCTION

Unsupervised generative modeling aims at generating samples following the same distribution as the observed data. Recent deep generative models have shown impressive performance in generating various data modalities such as image, text and audio, owing to the use of a huge number of parameters in their models. The well known examples include VQ-GAN (Esser et al., 2021) for high-resolution image synthesis, DALLE (Ramesh et al., 2021) for realistic image generation from a description in natural language, and Jukebox (Dhariwal et al., 2020) for music generation. Surprisingly, all these models are based, at least partly, on Vector Quantized Variational Autoencoders (VQ-VAE) (Van Den Oord et al., 2017) . The success of VQ-VAE should be mostly attributed to its ability of learning discrete, rather than continuous, latent representations and its decoupling of learning the discrete representation and the prior. The quality of the discrete representation is essential to the quality of the generation and our work improves upon the discrete representation learning for arbitrary data modality. VQ-VAE is a variant of VAEs (Kingma & Welling, 2014) that first encodes the input data to a discrete variable in a latent space, and then decodes the latent variable to a sample of the input space. The discrete representation of the latent variable is enabled by vector quantization, generally through a nearest neighbor look up in a learnable codebook. A new sample is then generated by decoding a discrete latent variable sampled from an approximate prior, which is learned on the space of the encoded discrete latent variables in a decoupled fashion using any autoregressive model (Van Den Oord et al., 2017) . Despite its promising results in many tasks of generating complex data modalities, the naive training scheme of VQ-VAE used in (Van Den Oord et al., 2017) often suffers from codebook collapse (Takida et al., 2022) , i.e. only a fraction of codes are effectively used, which largely limits the quality of the discrete latent representations. To this end, many techniques and variants have been proposed, such as stop-gradient along with the commitment and embedding loss (Van Den Oord et al., 2017) , exponential moving average (EMA) for codebook update (Van Den Oord et al., 2017) , codebook reset (Williams et al., 2020) and a stochastic variant (SQ-VAE) (Takida et al., 2022) . Interestingly, the idea of vector quantization has also been explored in the related field of selfsupervised learning, though it generally relies on unsupervised discriminative modeling by only obtaining data features that can be easily generalized to downstream tasks 2020), assigns to each cluster (represented by a code in the codebook) the same number of samples (see Figure 1 ). Then, we use a Gumbel-softmax trick Jang et al. ( 2017) to sample from the discrete categorical distribution while easily assessing its gradient. The resulting approach, named OT-VAE, enforces the full utilization of the codebook while not using any heuristics, namely stop-gradient, EMA, and codebook reset. Unlike SQ-VAE that uses a stochastic quantization and dequantization process, our approach explicitly enforces the equipartition constraint for quantization in a deterministic way while only using a stochastic decoding. To the best of our knowledge, our approach shows the first time that such an equipartition condition, arised in the field of self-supervised learning, is also useful for generative tasks. We empirically validate our approach on three different data modalities: images, speech and 3D human motions. For all the modalities, OT-VAE shows better reconstruction with higher perplexity than other VQ-VAE variants on several datasets. In particular, OT-VAE achieves state-of-the-art results on the AIST++ (Li et al., 2021) dataset for 3D dance generation. Overall, our contribution can be summarized below: • We reformulate VQ-VAE as an instance of WAEs, which provides a connection between distribution matching in the latent space of VQ-VAE and the clustering methods used in self-supervised learning. • We propose OT-VAE, a novel unsupervised generative model explicitly using the equipartition constraint with OT to address the codebook collapse issue in VQ-VAE. • In our experiments, without using classic heuristics (such as stop-gradient, EMA, codebook reset etc.), we show that OT-VAE achieves better reconstruction and perplexity than other variants of VQ-VAE for three data modalities: image, speech and 3D human motion. 



. The seminal work by Caron et al. (2018) used an encoder and a clustering algorithm to learn discriminative representations of the data. The clusering algorithm used in Caron et al. (2018), namely K-means, could be interpreted as an offline version of the vector quantization used in VQ-VAE. Similar to the codebook collapse, some clusters were observed to have a single element, known as cluster collapse. To address this problem, Asano et al. (2020) have proposed an optimal transport (OT) based clustering method to explicitly enforce the equipartition of the clusters. Caron et al. (2020) have later proposed an online version of their algorithm for dealing with large-scale datasets.

Figure 1: Optimal transport (OT) explicitly enforce the equipartition of the clusters. In this work, we reformulate VQ-VAE under the framework of Wasserstein Autoencoders (WAE) (Tolstikhin et al., 2018), providing a natural connection between distribution matching in the latent space of VQ-VAE and the clustering used in self-supervised learning. Based on this reformulation, we propose to use an online clustering method to address the codebook collapse issue of training VQ-VAE, by adding the equipartition constraint from Asano et al. (2020) and Caron et al. (2020). The online clustering method, inspired by the OT techniques used in Caron et al. (2020), assigns to each cluster (represented by a code in the codebook) the same number of samples (see Figure1). Then, we use a Gumbel-softmax trick Jang et al. (2017) to sample from the discrete categorical distribution while easily assessing its gradient. The resulting approach, named OT-VAE, enforces the full utilization of the codebook while not using any heuristics, namely stop-gradient, EMA, and codebook reset. Unlike SQ-VAE that uses a stochastic quantization and dequantization process, our approach explicitly enforces the equipartition constraint for quantization in a deterministic way while only using a stochastic decoding. To the best of our knowledge, our approach shows the first time that such an equipartition condition, arised in the field of self-supervised learning, is also useful for generative tasks. We empirically validate our approach on three different data modalities: images, speech and 3D human motions. For all the modalities, OT-VAE shows better reconstruction with higher perplexity than other VQ-VAE variants on several datasets. In particular, OT-VAE achieves state-of-the-art results on the AIST++(Li et al., 2021)  dataset for 3D dance generation. Overall, our contribution can be summarized below:

• Using OT-VAE instead of VQ-VAE in the Bailando model(Li et al., 2022), we obtain stateof-the-art results for 3D Dance generation. Precisely, we improve FID k from 28.75 to 26.74 and FID g from 11.82 to 9.81 on the AIST++ dataset. Kingma & Welling, 2014) with a discrete prior. VQ-VAE shows good performance on various generation tasks, which includes: image synthesis(Williams et al., 2020; Esser et al., 2021), text to image generation(Ramesh et al., 2021), motion generation(Li et al., 2022), music generation(Dieleman et al., 2018; Dhariwal et al., 2020)  etc. However, a naive training of VQ-VAE suffers from the codebook collapse. To alleviate the problem, a number of techniques are commonly used during the training, including stop-gradient along with some losses(Van Den Oord et al., 2017),

