A VAE FOR TRANSFORMERS WITH NONPARAMETRIC VARIATIONAL INFORMATION BOTTLENECK

Abstract

We propose a Variational AutoEncoder (VAE) for Transformers by developing a Variational Information Bottleneck (VIB) regulariser for Transformer embeddings. We formalise such attention-based representations as mixture distributions, and use Bayesian nonparametrics to develop a Nonparametric VIB (NVIB) for them. The variable number of mixture components supported by nonparametrics captures the variable number of vectors supported by attention, and exchangeable distributions from nonparametrics capture the permutation invariance of attention. Our Transformer VAE (NVAE) uses NVIB to regularise the information passing from the Transformer encoder to the Transformer decoder. Evaluations of a NVAE, trained on natural language text, demonstrate that NVIB can regularise the number of mixture components in the induced embedding whilst maintaining generation quality and reconstruction capacity.

1. INTRODUCTION

Attention-based deep learning models, such as Transformers (Vaswani et al., 2017; Devlin et al., 2019) , have achieved unprecedented empirical success in a wide range of cognitive tasks, in particular in natural language processing. The use of attention allows these models to represent their input with multiple vectors, which is essential for embedding natural language text (Bahdanau et al., 2015) . On the other hand, deep variational Bayesian approaches to representation learning, such as variational autoencoders (VAEs) (Kingma & Welling, 2014) , have also been shown to have many benefits (Mathieu et al., 2019; Ghosh et al., 2020; Vahdat & Kautz, 2020) , especially due to their variational information bottleneck (VIB) (Alemi et al., 2017) for regularising the induced latent representations. However, it has not been clear how to combine these two trends, because the latent space induced by Transformers is a set of vectors whose size grows with the size of the input, whereas standard VIB methods only apply to a vector space of a fixed size (Liu & Liu, 2019; Fang et al., 2021; Park & Lee, 2021) . To define a VIB regulariser for a Transformer's embedding space, we need to allow the size of a latent representation to vary dynamically depending on the complexity of the individual input, and yet regularise the total amount of information conveyed by the whole representation. In this paper, we propose such a variational information bottleneck for variable sized latent representations, which we use to regularise the embeddings of a Transformer encoder-decoder, giving us a variational autoencoder for Transformers.foot_0 Like a Transformer encoder's embedding space, the proposed VAE's sampled encoder output is (a generalisation of) a set of vectors, and the decoder accesses this embedding with (a generalisation of) cross attention. But unlike Transformers, the proposed VIB layer for this VAE regularises the (effective) number of vectors in the set, as well as the information conveyed by each vector. We show that this regularisation improves generative abilities and compresses latent representations. In addition to the regularisation of over-parameterised language models (Child et al., 2019) , previous work shows the efficacy of VAEs for: disentanglement (Higgins et al., 2017) , language generation (Liu & Liu, 2019), and explainability (Mercatali & Freitas, 2021) . All these topics are important and active areas of research in NLP. To define this VIB, we need to model distributions over these variable-sized encoder embeddings, as interpreted by cross attention. Firstly, because the attention function returns an interpolation between the vectors output by the encoder, it generalises across the varying number of vectors, which like the input length is theoretically unbounded. Thus, to define distributions over these unbounded embeddings, we need to use nonparametric methods (Jordan, 2010) . Secondly, the attention function is insensitive to the order of the vectors output by the encoder, so it interprets this embedding as a permutation-invariant set of vectors. Thus, the distributions over these permutation-invariant embeddings should be exchangeable (Jordan, 2010). Thirdly, the attention function imposes a normalised weighting over the embedding vectors, via the attention weights. So we should model an embedding as a distribution rather than a set. A normalised weighting over an unbounded permutation-invariant set of fixed-length vectors matches exactly the properties of a nonparametric space of mixture distributions, which have been extensively studied in Bayesian nonparametrics using exchangeable distributions (Blei & Jordan, 2006; Jordan, 2010) . In previous work, Bayesian nonparametrics is typically applied to learning models where the number of parameters grows with the size of the training data (Teh, 2010; Jordan, 2010; Kossen et al., 2021) . In contrast, we apply it to inferring latent representations where the number of parameters grows with the size of the input. We believe this is the first work to use nonparametric methods in this way for deep variational Bayesian models. To define a precise equivalence between attention-based representations and mixture distributions, we provide an interpretation of attention where the input set of vectors defines a mixture of impulse distributions, which is used as a prior to denoise the observed query vector (depicted in Figure 1b ). Generalising sets of vectors to mixture distributions and generalising the attention function to query denoising allows us to propose a general deep variational Bayesian framework for attention-based models using Bayesian nonparametrics. More specifically, we propose to use Dirichlet processes (DPs) as the exchangeable distributions (Aldous, 1985; Jordan, 2010) to specify distributions over mixtures of impulse distributions, including distributions over the effective number of components in the mixture. We define a nonparametric VIB (NVIB) layer using a bounded DP prior and posterior to regularise the effective size of variable-sized latent representations. This NVIB layer uses exact inference to infer the posterior from a set of pseudo-observations, and uses proposed efficient approximations to sample from this posterior with a reparameterisation trick and to regularise it with the KL divergence with the prior. Applying this NVIB regulariser to a Transformer autoencoder gives us our proposed nonparametric variational autoencoder (NVAE), depicted in Figure 1a . The noise introduced by sampling from the DP posterior controls the amount of information which flows from the encoder to the decoder, despite the fact that the amount of information required to reconstruct different text inputs varies enormously. To evaluate the effectiveness of NVIB, we train a NVAE on natural language text and find that it is able to reconstruct, generate and regularise the effective number of vectors in the latent representation, thereby demonstrating that NVAE is a viable VAE. We also find that the regularised latent space is smooth, using a proposed method for interpolating between DP posteriors to generate interpolations between sentences. Contributions This paper makes the following contributions: (1) We propose a variational Bayesian framework for modelling attention-based representations using mixture distributions, denoising attention and Bayesian nonparametrics (Section 2). ( 2) We propose a nonparametric variational information bottleneck (NVIB) regulariser for learning attention-based representations (Section 3). (3) We propose a nonparametric variational autoencoder (NVAE), which is a variational Bayesian extension of a Transformer encoder-decoder (Section 4). ( 4) We show that the NVAE model is a competitive VAE which can reconstruct, generate, regularise its latent space and intuitively interpolate between sentences (Section 5).



The code is available at https://github.com/idiap/nvib and https://github.com/idiap/nvib transformers.



(a) The NVAE model and its NVIB layer. (b) Denoising attention. (c) Test-time denoising attention.

Figure 1: (a) Illustration of the NVAE model, with its NVIB layer. (b) Query denoising attention at training time, with the sampled distribution as the query prior, a noisy query observation, and the expected value of the denoised query. (c) Query denoising attention at test time using the mean distribution.

