A VAE FOR TRANSFORMERS WITH NONPARAMETRIC VARIATIONAL INFORMATION BOTTLENECK

Abstract

We propose a Variational AutoEncoder (VAE) for Transformers by developing a Variational Information Bottleneck (VIB) regulariser for Transformer embeddings. We formalise such attention-based representations as mixture distributions, and use Bayesian nonparametrics to develop a Nonparametric VIB (NVIB) for them. The variable number of mixture components supported by nonparametrics captures the variable number of vectors supported by attention, and exchangeable distributions from nonparametrics capture the permutation invariance of attention. Our Transformer VAE (NVAE) uses NVIB to regularise the information passing from the Transformer encoder to the Transformer decoder. Evaluations of a NVAE, trained on natural language text, demonstrate that NVIB can regularise the number of mixture components in the induced embedding whilst maintaining generation quality and reconstruction capacity. Contributions This paper makes the following contributions: (1) We propose a variational Bayesian framework for modelling attention-based representations using mixture distributions, denoising attention and Bayesian nonparametrics (Section 2). ( 2) We propose a nonparametric variational information bottleneck (NVIB) regulariser for learning attention-based representations (Section 3). (3) We propose a nonparametric variational autoencoder (NVAE), which is a variational Bayesian extension of a Transformer encoder-decoder (Section 4). ( 4) We show that the NVAE model is a competitive VAE which can reconstruct, generate, regularise its latent space and intuitively interpolate between sentences (Section 5).

1. INTRODUCTION

Attention-based deep learning models, such as Transformers (Vaswani et al., 2017; Devlin et al., 2019) , have achieved unprecedented empirical success in a wide range of cognitive tasks, in particular in natural language processing. The use of attention allows these models to represent their input with multiple vectors, which is essential for embedding natural language text (Bahdanau et al., 2015) . On the other hand, deep variational Bayesian approaches to representation learning, such as variational autoencoders (VAEs) (Kingma & Welling, 2014) , have also been shown to have many benefits (Mathieu et al., 2019; Ghosh et al., 2020; Vahdat & Kautz, 2020) , especially due to their variational information bottleneck (VIB) (Alemi et al., 2017) for regularising the induced latent representations. However, it has not been clear how to combine these two trends, because the latent space induced by Transformers is a set of vectors whose size grows with the size of the input, whereas standard VIB methods only apply to a vector space of a fixed size (Liu & Liu, 2019; Fang et al., 2021; Park & Lee, 2021) . To define a VIB regulariser for a Transformer's embedding space, we need to allow the size of a latent representation to vary dynamically depending on the complexity of the individual input, and yet regularise the total amount of information conveyed by the whole representation. In this paper, we propose such a variational information bottleneck for variable sized latent representations, which we use to regularise the embeddings of a Transformer encoder-decoder, giving us a variational autoencoder for Transformers. 1 Like a Transformer encoder's embedding space, the proposed VAE's sampled encoder output is (a generalisation of) a set of vectors, and the decoder accesses this embedding with (a generalisation of) cross attention. But unlike Transformers, the proposed VIB layer for this VAE regularises the (effective) number of vectors in the set, as well as the information conveyed by each vector. We show that this regularisation improves generative abilities and compresses latent representations. In addition to the regularisation of over-parameterised language models (Child et al., 2019) , previous work shows the efficacy of VAEs for: disentanglement (Higgins et al., 2017) , language generation (Liu & Liu, 2019) , and explainability (Mercatali & Freitas, 2021) . All these topics are important and active areas of research in NLP. To define this VIB, we need to model distributions over these variable-sized encoder embeddings, as interpreted by cross attention. Firstly, because the attention function returns an interpolation between the vectors output by the encoder, it generalises across the varying number of vectors, which like the input length is theoretically unbounded. Thus, to define distributions over these unbounded embeddings, we need to use nonparametric methods (Jordan, 2010) . Secondly, the attention function is insensitive to the order of the vectors output by the encoder, so it interprets this embedding as a permutation-invariant set of vectors. Thus, the distributions over these permutation-invariant embeddings should be exchangeable (Jordan, 2010) . Thirdly, the attention function imposes a normalised weighting over the embedding vectors, via the attention weights. So we should model an embedding as a distribution rather than a set. A normalised weighting over an unbounded permutation-invariant set of fixed-length vectors matches exactly the properties of a nonparametric space of mixture distributions, which have been extensively studied in Bayesian nonparametrics using exchangeable distributions (Blei & Jordan, 2006; Jordan, 2010) . In previous work, Bayesian nonparametrics is typically applied to learning models where the number of parameters grows with the size of the training data (Teh, 2010; Jordan, 2010; Kossen et al., 2021) . In contrast, we apply it to inferring latent representations where the number of parameters grows with the size of the input. We believe this is the first work to use nonparametric methods in this way for deep variational Bayesian models. To define a precise equivalence between attention-based representations and mixture distributions, we provide an interpretation of attention where the input set of vectors defines a mixture of impulse distributions, which is used as a prior to denoise the observed query vector (depicted in Figure 1b ). Generalising sets of vectors to mixture distributions and generalising the attention function to query denoising allows us to propose a general deep variational Bayesian framework for attention-based models using Bayesian nonparametrics. More specifically, we propose to use Dirichlet processes (DPs) as the exchangeable distributions (Aldous, 1985; Jordan, 2010) to specify distributions over mixtures of impulse distributions, including distributions over the effective number of components in the mixture. We define a nonparametric VIB (NVIB) layer using a bounded DP prior and posterior to regularise the effective size of variable-sized latent representations. This NVIB layer uses exact inference to infer the posterior from a set of pseudo-observations, and uses proposed efficient approximations to sample from this posterior with a reparameterisation trick and to regularise it with the KL divergence with the prior. Applying this NVIB regulariser to a Transformer autoencoder gives us our proposed nonparametric variational autoencoder (NVAE), depicted in Figure 1a . The noise introduced by sampling from the DP posterior controls the amount of information which flows from the encoder to the decoder, despite the fact that the amount of information required to reconstruct different text inputs varies enormously. To evaluate the effectiveness of NVIB, we train a NVAE on natural language text and find that it is able to reconstruct, generate and regularise the effective number of vectors in the latent representation, thereby demonstrating that NVAE is a viable VAE. We also find that the regularised latent space is smooth, using a proposed method for interpolating between DP posteriors to generate interpolations between sentences. Related work Related work in stochastic attention assume that the keys, queries, values (Martin et al., 2020) or attention weight vectors of the network are treated as latent random variables (Deng et al., 2018; Bahuleyan et al., 2018; Fan et al., 2020; Cinquin et al., 2022) . Nguyen et al. (2022) provides a formulation and interpretation of attention keys as latent mixture distributions, whereas our formulation characterises the whole attention function and is interpreted as Bayesian query denoising. The use of Bayesian nonparametrics to learn a variable sized latent space using a VAE (Nalisnick & Smyth, 2017; Goyal et al., 2017; Echraibi et al., 2020) still assumes a fixed-sized latent representation at test time, unlike our proposal.

2. A NONPARAMETRIC BAYESIAN FRAMEWORK FOR TRANSFORMER EMBEDDINGS

This section proposes a formalisation of attention-based representations as mixture distributions over a vector space, and proposes nonparametric Bayesian methods for modelling information about these mixture distributions. First we show that standard attention functions can be interpreted as implementing Bayesian query denoising, where the set of vectors being accessed specifies a mixture of impulse distributions (Section 2.1). We adopt mixture distributions over vectors as a generalisation of attention-based representations, and adopt this denoising function as a generalisation of attention. Then we use Bayesian nonparametrics to propose prior (Section 2.2) and posterior distributions (Section 2.3) over these mixture distributions. These priors and posteriors form the basis of our nonparametric variational information bottleneck, proposed in Section 3.

2.1. DENOISING ATTENTION

The attention function provides access to a set of vectors by mapping a query vector to the resulting attention vector. As the basis of our approach to attention-based representations, we generalise the set of vectors to a probability distribution over vectors, and generalise attention to a function of these probability distributions. The attention mechanism we assume is scaled dot product attention, standardly used in many attentionbased models including Transformers. For simplicity, we consider cross attention, where a single query vector is mapped to a single result vector. This attention function projects the input vector u ′ ∈R 1×p via the weight matrix W Q ∈R p×d to a query, and projects the set of vectors Z ∈R n×p via weight matrices W K ,W V ∈ R p×d to keys and values, respectively. It uses the keys' dimensionality d for scaling. We regroup this scaled dot product attention function into a core dot product attention function Attn(u,Z) in which all operations are done in the space of Z. Attention(u ′ ,Z ; W Q ,W K ,W V ) = Attn(u ′ W Q (W K ) T , Z) W V = Attn(u,Z) W V where u = (u ′ W Q (W K ) T ) ∈ R 1×p . The function Attn(u,Z) can then be defined in two equivalent ways (as shown in Appendix G): in terms of a sum over the vectors z i in Z, or in terms of an integral over a distribution which is only nonzero at the z i : Attn(u,Z) = softmax 1 √ d uZ T Z = DAttn(u; F Z ) F Z = n i=1 exp( 1 2 √ d ||z i || 2 ) n i=1 exp( 1 2 √ d ||z i || 2 ) δ zi (2) DAttn(u; F ) = v f(v) g(u; v, √ dI) v f(v) g(u; v, √ dI) dv v dv (3) where δ zi is an impulse distribution at z i , f(•) is the probability density function for distribution F , and g(u; v, √ dI) is the multivariate Gaussian function with diagonal variance of √ d. As depicted in Figure 1b , DAttn(u; F Z ) can be interpreted as query denoising. The query u is interpreted as an observation of some true vector v which has been corrupted by Gaussian noise, where v was generated from a prior probability distribution F Z specified by Z. The result of Attn(u,Z) is the expected value of this true vector v after seeing the noisy observation u, which can be interpreted as a form of denoising. This denoising attention function DAttn(u; F ) is actually a generalisation of attention over a set of vectors, in that it is defined for any probability distribution F over a vector space. In the special case where F = i π i δ zi is a finite mixture of impulse distributions, it is the same as Attn(u,Z) but with a bias term log(π i ) substituted for 1

2.2. A PRIOR OVER MIXTURE DISTRIBUTIONS

Given that our attention-based latent representations are formalised as mixture distributions F , a Bayesian approach requires a prior over these distributions. Attention-based models place no finite bound on the possible number of vectors in their set of vectors Z, and thus there is no finite bound on the number of parameters needed to specify the equivalent mixture distribution F . Nonetheless, we can still specify probability distributions over this infinite space of possible distributions F , using methods from Bayesian nonparametrics. These nonparametric Bayesian methods, with exchangeable distributions, are specifically designed for modelling probability distributions over unboundedly large mixture distributions. We base our distributions over mixture distributions on Dirichlet processes DP(G 0 ,α 0 ). Dirichlet processes (DPs) are a generalisation of Dirichlet distributions to an infinite support, such as the points in a vector space. A Dirichlet distribution Dir(α) is a distribution over probability mass functions π of discrete categories i, 1≤i≤κ. One useful definition of Dirichlet processes views a DP F ∼DP(G 0 ,α 0 ), where G 0 is the base distribution over vectors and α 0 ∈R is the concentration parameter, as the limit of a sequence of finite Dirichlet distributions (see Teh (2010) ), given in equation 4. Note that the Dirichlet distributions in equation 4 are symmetric, in that all the κ categories i have the same α i = α0 κ parameter values. However, these categories end up with very different weights π i , due to the most probable categories getting a large proportion of the probability mass and the tail of categories getting an exponentially decreasing amount of probability mass. In the infinite limit, this tail is infinitely long with infinitesimal probabilities. The number of categories which get nontrivial probabilities is determined by α 0 , and becomes independent of κ as κ gets large. As shown in this definition, each sample F from a DP is an infinite mixture of impulse distributions δ zi , parameterised by an infinite sequence of weight-vector pairs π i ,z i . This contrasts with the finite Z in attention-based representations. Having an infinite F would also cause problems in our variational Bayesian model, because VIB uses a bound on the log-likelihood (see Section 3.1), which Kingma & Welling (2014) showed has an error of D KL (q(F x) ∥ p(F x)) (the looseness of the bound). This would be infinite unless both the true posterior p(F x) and its approximation q(F x) generate a finite F , so we need a prior which generates finite F . F = ∞ i=1 π i δ zi (4) π ∼ lim κ→∞ Dir( α 0 κ , κ ..., α 0 κ ) z i ∼ G 0 for i=1,...,∞ F = κ0 i=1 π i δ zi (5) π ∼ Dir( α 0 κ 0 , κ0 ..., α 0 κ 0 ) z i ∼ G 0 for i=1,...,κ 0 The Unbounded Dirichlet Process Prior We do not want a prior which places an apriori bound on the size of F , so we assume it is finite but unbounded, and propose a prior which is an unbounded sequence of finite approximations to a DP. We define a bounded DP F ∼BDP(G 0 ,α 0 ,κ 0 ) as in equation 5. Our approach to the prior is to use an unbounded but finite κ 0 , so we define a distribution over approximations as κ 0 increases towards infinity. Hence, every distribution is over a finite number of vectors, but there is no finite bound on the number of vectors in all distributions. Given ϕ is some distribution over positive integers κ∈Z + , we define this unbounded DP as UDP(G 0 ,α 0 ,ϕ)=BDP(G 0 ,α 0 ,κ) where κ∼ϕ. We use these definitions both to define a general prior over probability distributions, and to define a conditional prior for each input length. In both cases, the base distribution G p 0 is assumed to be a unit Gaussian (inspired by Kingma & Welling (2014) ) and the concentration parameter α p 0 is assumed to be one.foot_2 G p 0 = N (µ p ,I(σ p ) 2 ); α p 0 = 1; µ p = 0; σ p = 1 The general prior is UDP(G p 0 ,α p 0 ,ϕ p ), where the size distribution ϕ p is determined empirically. The conditional prior BDP(G p 0 ,α p 0 ,κ 0 ) sets the level of approximation κ 0 as a fixed function of the input length n, in particular κ 0 =(n+1)κ ∆ , where κ ∆ ∈Z + is a hyperparameter that controls the approximation. A Conditional Bounded DP Prior It will be useful to generalise this conditioning for the level of approximation to any conditional prior which is a fixed function of only the input length. If we know the input length n, but know nothing about the content of the text, then the distribution of vectors should stay the same as the general prior, G p ′ 0 =G p 0 . However, the count of observations we expect to have after an input of that length would not be α p 0 , but should include a pseudo-count α ∆ ∈ R ≥0 hyperparameter for every token, and thus α p ′ 0 =α p 0 +nα ∆ . This then gives us the conditional prior given n of BDP(G p 0 ,α p ′ 0 ,κ 0 ).

2.3. A POSTERIOR OVER MIXTURE DISTRIBUTIONS

Since a DP is a conjugate prior, we can use exact inference to compute the posterior DP from the prior DP plus a set of pseudo-observations output by the encoder. Each pseudo-observation is a real-valued pseudo-count α q i ∈ R ≥0 and a parametric distribution which represents uncertainty in the observation. We use an isotropic Gaussian, G q i = N (µ q i , I(σ q i ) 2 ), as the parametric distribution, specified by a mean µ q i ∈ R 1×d and a standard deviation σ q i ∈ R 1×d >0 . Here we assume that the number of candidate pseudo-observations is the same as the length n of the input, but some of these pseudo-observations may have zero pseudo-counts and thus be effectively removed from the set. The formula for exact inference of the posterior DP is given in equation 6, where there is an n+1 th component of the base distribution G q 0 which comes from the prior, namely α q n+1 =α p 0 and G q n+1 =G p 0 . F ∼ DP(G q 0 , α q 0 ) (6) α q 0 = n+1 i=1 α q i ; G q 0 = n+1 i=1 α q i α q 0 G q i F = n+1 i=1 ρ i F i (7) ρ ∼ Dir(α q 1 ,...,α q n+1 ) F i ∼ BDP(G q i ,α q i ,κ ∆ ) for i=1,...,n+1 We derive an alternative factorisation of the posterior DP (Appendix H) which helps with the sampling method in Section 3.2. We then bound this factorised DP so that it generates the same number κ 0 =(n+1)κ ∆ of weighted vectors as the prior BDP(G p 0 , α p 0 , κ 0 ). The resulting bounded posterior F ∼ BFDP(G q , α q , κ ∆ ) is given in equation 7, which defines our posterior distribution q(F x). This posterior simplifies to a mixture F = n+1 i=1 κ ∆ j=1 ρ i π ′ ij δ zij of impulse distributions, with ρ∼Dir(α q 1 ,...,α q n+1 ), π ′ i ∼Dir( α q i κ ∆ , κ ∆ ... , α q i κ ∆ ), and z ij ∼G q i . The Mean Posterior Mixture Distribution A VAE is trained on samples from the posterior, but at test time VAEs typically use the mean of this distribution. Generalising the latent space to mixture distributions makes this straightforward, since the mean of our BFDP posterior is its base distribution G q 0 . This base distribution is a continuous distribution, whereas at training time all samples are discrete distributions. Nonetheless, when accessed via denoising attention, the base distribution looks like a typical sample from the posterior. This is visualised in Figure 1 by comparing the vector returned by denoising attention (in green) given the continuous mean distribution (Figure 1c ) and a typical sample from this distribution (Figure 1b ). Thus, the function defined by applying denoising attention to a sampled distribution can be seen as a noisy version of the function defined by applying denoising attention to the mean distribution. For our Gaussian mixture base distribution G q 0 , there is a closed-form solution to computing denoising attention, given in Appendix F.

3. THE NONPARAMETRIC VARIATIONAL INFORMATION BOTTLENECK

By generalising attention-based representations to mixture distributions and generalising the attention function to denoising attention, we can define a VIB regulariser for attention-based interfaces. Such encoder-decoder interfaces take a set-of-vectors representation and return a function from query to result vectors. We map the input set-of-vectors to a set of pseudo-observations, and define the returned function with denoising attention. Then, given the nonparametric prior and posterior from Section 2, we can define our nonparametric VIB regulariser by specifying how to compute the KL divergence between the prior and posterior, and how to effectively sample from the posterior for training. As far as we are aware, this proposal is the first VIB model for attention-based representations like Transformer embeddings. The VIB layer in a VAE controls the amount of information passing through it by introducing noise according to a posterior output by the encoder, and regularises this information by minimising the KL divergence between this posterior and an uninformative prior. One of the known difficulties with VAEs is that they can be difficult to train due to the posterior collapsing to the prior (Bowman et al., 2016) . Similarly to the freebits objective proposed to address this problem for vector-space VAEs (Kingma et al., 2016) , instead of regularising towards the prior, which gets no pseudo-counts from the input, we regularise towards the conditional prior BDP(G p 0 ,α p ′ 0 ,κ 0 ), which gets nα ∆ pseudo-counts from the input but knows nothing about the information they convey. We found that this helps with the stability of training and avoiding posterior collapse.

3.1. THE VARIATIONAL INFORMATION BOTTLENECK LOSS

The evidence lower bound (ELBO) is commonly used in variational Bayesian methods as an objective which approximately maximises the log-likelihood of the observation x, where x is the input text. log(p(x)) ≥ E q(F x) log(p(x F )) -D KL (q(F x) ∥ p(F )) (8) L R = -E q(F x) log(p(x F )) The first term of the bound is the reconstruction loss L R , computed using samples F from the approximate posterior q(F x), and the second term is the KL divergence between this posterior and the prior p(F ). For the KL divergence term, since both the prior and the posterior are conditioned on the same bound κ 0 on the number of vectors they generate, we can compute a meaningful finite KL divergence between the conditional prior p(F )=BDP(G p 0 ,α p ′ 0 ,κ 0 ) and the posterior q(F x)=BFDP(G q ,α q ,κ ∆ ), given κ ∆ and κ 0 =(n+1)κ ∆ . The derivation of the KL divergence is given in Appendix I. It has two terms, one L D for the distribution of weights π generated by the Dirichlet distributions, and one L G for the distribution of vectors Z generated by the component Gaussians. Using the exact KL divergence for our bounded DPs would regularise each component equally, even for components which have zero α q i and thus have no impact on the posterior. We instead approximate a KL divergence where only samples with a nontrivial weight are regularised. Marginalising over the number of nontrivial weights for each component does not appear to be tractable for L D , but since the relationship is approximately linear (see Appendix I), we simply substitute the expected number α q i α q 0 κ 0 for the actual number in the equation for KL divergence. This gives us the following loss terms for the KL divergence, where Γ is the gamma function and ψ is the digamma function. L D +L G ≈ D KL (q(F x) ∥ p(F )) L D = logΓ(α q 0 )-logΓ(α p ′ 0 )+(α q 0 -α p ′ 0 ) -ψ(α q 0 )+ψ( α q 0 κ 0 ) +κ 0 logΓ( α p ′ 0 κ 0 )-logΓ( α q 0 κ 0 ) L G = 1 2 κ 0 n+1 i=1 α q i α q 0 d h=1 (µ q ih -µ p h ) 2 (σ p h ) 2 + (σ q ih ) 2 (σ p h ) 2 -1-log (σ q ih ) 2 (σ p h ) 2 Both L D and L G scale approximately linearly with κ 0 . To generalise the ELBO beyond autoencoders, it can be viewed as a way to regularise the amount of information which passes through the latent representation (Alemi et al., 2017) . This VIB interpretation allows the different parts of the objective to have different weights. We introduce two hyperparameters to control the relative weight of the above three parts of the ELBO, which defines our VIB loss L. L = L R + λ D L D + λ G L G (10) 3.2 SAMPLING A MIXTURE DISTRIBUTION FROM THE POSTERIOR To control the amount of information which passes from the encoder to the decoder, at training time a VAE (Kingma & Welling, 2014) samples from the encoder's posterior distribution and uses this sample to reconstruct the input. The "reparameterisation trick" is used to ensure that backpropagation of the reconstruction error through this sampling step can be done effectively. We propose a novel reparameterisation trick for bounded Dirichlet processes which allows sampling without any categorical choices, and propose specific sampling methods which result in effective backpropagation through the sampling step. For our NVIB model, we sample the parameters ⟨π,Z⟩ of a mixture distribution F generated by our bounded Dirichlet process posterior BFDP(G q ,α q ,κ ∆ ), where F consists of a set of impulse distributions δ zi each with a weight π i . A straightforward approach to sampling from a Dirichlet process would independently sample weights π from a (theoretically infinite) Dirichlet distribution and sample vectors Z from the base distribution of the DP, where sampling from the base distribution involves first sampling a component of the base distribution and then sampling a vector from that component's Gaussian.

A Factorised Sampling Method

The problem is that sampling a component is a discrete choice, for which there is no exact reparameterisation trick. Instead, we note that the components do not differ in the number of vectors sampled from each one (always theoretically infinite for a DP), but only differ in the distribution of weights for those vectors. As specified in the factorised DP in Section 2.3, we characterise this distribution over weights by factorising it into two steps: first choosing how the total weight is distributed across components (ρ), and then for each component choosing how its weight is distributed across its vectors (π ′ i ). These are both continuous choices. The vectors can then be sampled independently from each component. Reparameterisation tricks Each individual component specifies a Gaussian distribution over vectors, so we can use the same reparameterisation trick as Kingma & Welling (2014) for sampling vectors Z i from an individual component i. The factorised and bounded nature of our posterior BFDP(G q ,α q ,κ ∆ ) means that the total weights ρ and the individual weights π ′ i are all sampled from Dirichlet distributions. A Dirichlet distribution over a set of category weights can be sampled by sampling from a Gamma distribution for each category and then normalising. There is no closed-form, explicit reparameterisation trick for the exact Gamma distribution, but there are for approximations. Knowles (2015) proposes two such approximations which we combine, one which is more accurate for small α and one which is more accurate for larger α. We leave the investigation of implicit reparameterisation gradients (Figurnov et al., 2018) , an alternative approach to explicit reparameterisation, to future work. More specifics about the sampling methods and their reparameterisation tricks is given in Appendix J. We provide an evaluation (Appendix B.3) showing that κ ∆ =1 is an efficient and effective sampling method. In this case, there is no need to sample from the Dirichlet for each individual component, but we still sample the weights ρ across components.

4. THE NONPARAMETRIC VARIATIONAL AUTOENCODER

We define a VAE for Transformers by using the nonparametric VIB defined in Section 3 to regularise the attention-based representation between the encoder and decoder of a Transformer autoencoder, as depicted in Figure 1a . In this NVAE model, the Transformer encoder is used to estimate the parameters ⟨α q ,µ q ,σ q ⟩ of the posterior given the input text x. The Transformer decoder is used to reconstruct the input text x using denoising attention over a sample F from this posterior. The NVIB Regulariser Our NVIB layer regularises the amount of information which passes from the encoder to the decoder through this posterior. As with VIB for vector spaces, the KL divergence encourages the encoder to output component Gaussians with smaller µ q i and larger σ q i . With NVIB, the KL divergence also encourages the encoder to output smaller and sparser α q i , which regularises the effective number of components in the posterior as well as the noisiness of their weights. The Transformer Encoder The Transformer encoder q(F |x) with a text x of n number of tokens is used to compute a vector for each token i of the input. From each of these n individual token embeddings, the encoder then linearly projects to three parameters, α q i ∈ R, µ q i ∈ R 1×p and log(σ q i ) ∈ R 1×p . The variance parameters are exponentiated to be strictly positive, whereas the pseudo-count parameters α q i are estimated using a Rectified Linear Unit (ReLU) activation (Nair & Hinton, 2010) , which results in masking the vector during cross-attention when it is exactly zero. Thus, the DP posterior has one component ⟨α q i ,µ q i ,σ q i ⟩ of its base distribution for each token of the input (plus one for the prior). The Transformer Decoder The Transformer decoder q(x|F ) receives a distribution F over vectors and reconstructs the input text x. During training, F is specified by the sampled vectors Z∈R κ0×p and the sampled weights π ∈R κ0×1 , and at test time F is specified by the output of the encoder α q ∈R κ0×1 , µ q ∈R κ0×p and log(σ) q ∈R κ0×p . In both cases, the decoder accesses F using denoising attention in the same way that standard Transformer decoders use cross attention. We include the exact equations used for a deep learning implementation of denoising attention in Appendix K. During training, the text is predicted using teacher forcing, and during test time the text is predicted autoregressively using greedy decoding until the end-of-sequence token is generated or the sentence generated is 50 tokens larger than the target length. The Generative Model To use our NVAE model as a generative model, we sample from the prior and use the trained Transformer decoder to generate a sentence. As discussed in Section 2.2, to sample from the same prior as used for training, we need to first sample a sentence length, and then sample from the conditional prior given that (approximate) sentence length. For simplicity, we sample the sentence length from the empirical distribution of sentence lengths in the training data.

5. INTRINSIC EVALUATIONS OF NVIB IN NVAE

To support our theoretical contributions, we provide proof-of-concept experiments which demonstrate that our proposed NVIB regulariser performs as claimed. We evaluate it in our proposed NVAE model by training NVAEs on natural language text and evaluating the resulting models. We show that the NVAE is a viable VAE model as it exhibits a competitive reconstruction versus generation trade-off (Section 5.1). We show that the NVIB layer is able to dynamically choose the number of components it needs in its embeddings (Section 5.2). Additionally, NVIB provides an intuitive way to interpolate between sentence embeddings, which provides an evaluation of the smoothness of the latent space (Section 5.3). Data The Wikitext-2 and Wikitext-103 (Merity et al., 2017 ) encyclopedia datasets were selected as they are general English language corpora of a small and large scale containing high quality Wikipedia articles. Baselines We compare to various alternative ways to define a VAE from a Transformer autoencoder. As representative of a standard fixed-length-vector VAE, the Variational Transformer Pooled (VTP) baseline pools its vectors across the sequence length dimension, and then applies a Gaussian VIB layer (Kingma & Welling, 2014) . At the other extreme, the Variational Transformer (VT) baseline keeps all its vectors and applies a Gaussian VIB layer to each one. In between these baselines, as a hand-coded solution to constraining the quantity of latent vectors, Variational Transformer Stride (VTS) baselines, with parameter S, masks 1-S proportion of the embedding vectors based on their position. For comparability, all our baselines only differ from the NVAE model in the latent representation between the encoder and decoder, with the same Transformer encoder and Transformer decoder architectures.

5.1. RECONSTRUCTION VERSUS GENERATION

This section shows that the NVAE Transformer model is a competitive VAE in both reconstruction of input sentences and generation by sampling from the prior. All models undergo hyperparameter tuning on the validation set (Appendix B) across 5 seeds, to select the best models and then report results on the Wikitext-2 test set. For reconstruction, we report the SacreBLEU metric (Papineni et al., 2002; Post, 2018) . For generation, we report forward perplexity (F-PPL) and reverse perplexity (R-PPL) (Zhao et al., 2018; Cífka et al., 2018) , which trains an external language model on the gold training text and evaluates it on the generated text (F-PPL) or vice versa (R-PPL). Training details are provided in Appendix A. Figures 2a and 2b show the reconstruction versus generation trade-off on the Wikitext-2 test set, where lower right is better. The single vector baseline VTP is unable to reconstruct well (low BLEU) or generate diverse sentences (high R-PPL). Even with larger capacity and more training data, the model performs poorly in generation (Appendix B.4). The full vector baseline VT is unable to consistently generate fluent sentences (high F-PPL), whereas the position-based dropout baseline VTS shows that regularisation of the space is beneficial for both reconstruction and generation quality. The best NVAE models are competitive with the best VTS baselines in being able to both reconstruct and generate, even slightly better at generation with high reconstruction accuracy. In particular, reverse perplexity shows that NVAEs are able to generate a diverse collection of sentences from the prior, and forward perplexity shows that these are fluent sentences (samples in Appendix C). We believe this advantage is the result of learning what vectors to keep instead of a hand-coded position-based dropout. 

5.2. REGULARISATION

This section shows that the NVIB layer is able to regularise the number of vectors in the latent representation of a NVAE. Without regularisation the NVIB layer becomes a standard transformer without any noise and retaining all latent vectors (ablation in Appendix B.1). The proportion of vectors retained is controllable by the conditional prior hyperparameter α ∆ . Figure 2c shows that there exist α ∆ where the NVAE is able to remove a significant proportion (1-ν) of vectors whilst maintaining a high reconstruction performance. Figure 2d plots the number of latent vectors retained during evaluation against the number of input tokens. The NVAE models are able to learn to dynamically regularise the number of vectors based on the information within the text, without hand-coding a function of length as in VTS. The different NVAE lines show the effect of α ∆ on vector retention, with larger values allowing more vectors to be used (also see the evaluation in Appendix B.2).

5.3. INTERPOLATION

The NVIB framework provides an intuitive interpolation between latent sets of vectors that overcomes the challenges with the baselines of latent vector alignment (analysis in Appendix D) and variable set sizes. We simply interpolate the probability assigned to each point in vector space. Given two latent mixture distributions F 1 and F 2 , we decode from the combined mixture distribution (τF 1 +(1-τ)F 2 ), for varying interpolation rates 0≤τ ≤1. For the baselines we use τZ 1 +(1-τ)Z 2 such that the interpolation is over the content of the latent vectors Z i . We align latent sets of vectors by input position and pad smaller sets with zero vectors, which is the mean of the Gaussian prior. We evaluate the interpolation with selected larger scale models (see Appendix B.4) and use the Wikitext-103 validation set for S 1 and its reverse order for S 2 . The results in Table 1 and Figure 3 show that the NVIB regulariser in NVAE provides smoother interpolations, and improvements in fluency of interpolations over the baselines. 3 

6. CONCLUSIONS

In this work, we generalise the latent representations of attention-based models to mixture distributions over a vector space, and propose a nonparametric variational information bottleneck (NVIB) to regularise these latent representations. Using this NVIB model, we propose a nonparametric variational autoencoder (NVAE), which uses a Transformer encoder to embed text in a nonparametric space of distributions over mixture distributions, and uses a Transformer decoder to generate text given a sampled mixture distribution. This nonparametric Bayesian formalisation of attention-based representations captures two key properties of the attention function, namely its invariance to permutations of its input vectors, and that this input can vary widely in size. Our NVIB model adds the ability to regularise attention-based representations so that the size of the representation is appropriate for the complexity of the input being encoded. This is a crucial ability for encoding text, where the size of a text can vary enormously. Empirical evaluations indicate that this model: is a competitive VAE in that it is able to reconstruct input sentences and generate a good distribution over sentences from the prior; regularises the size of the induced latent representations as desired; and is able to intuitively interpolate smoothly between latent mixture distributions. Future work will evaluate the effectiveness of NVIB at a larger scale, when applied to multi-head attention and to self-attention layers, and when its pretrained representations are used in downstream tasks.

A EXPERIMENTAL SETUP

Training details All models are trained, without pretraining, using the same encoder and decoder configuration for comparability. We use a two layer Transformer encoder and decoder with a single attention-head. The size for the word embedding vectors and model projections are 256, feed forward dimensions 1024, which leads to models of approximately 19 million trainable parameters. The BERT base-uncased tokeniser is used for tokenisation with a vocabulary of approximately 30K. During training we use: a constant learning rate of 1e -4 , Adam optimiser (Kingma & Ba, 2015), a batch size of 256, gradient norm clipping 0.1 and trained for 50 epochs (≈15K steps). The number of epochs were selected considering model convergence and minimising computation time. As a form of regularisation we use a dropout rate of 0.1 and the VIB parameters λ G , λ D , α ∆ and κ ∆ are selected through hyperparameter tuning. Learning rate schedules, KL annealing strategies and free-bits KL loss objectives were considered but unnecessary for convergence. Each model experiment takes approximately 2hrs to run on a single NVIDIA GeForce RTX 3090.

Generation metrics

The two automatic generation metrics forward/reverse perplexity (Zhao et al., 2018; Cífka et al., 2018) both involve training an external language model, for which we use a Transformer language model with the same configuration as in the previous paragraph, but without VIB regularisation. First we generate 100k sample sentences from the model we wish to evaluate. The forward perplexity (F-PPL) measures the perplexity of the external language model trained on training data and evaluated on the generated text. This measures fluency of the text. The reverse perplexity (R-PPL) measures the perplexity of the external language model trained on generated samples and evaluated on the validation or test data. This captures the word frequency and overall proximity of our generated text distribution to the true text distribution. Data In general the Wikitext-2 dataset, which is a small subset of the encyclopedia Wikitext-103 dataset (Merity et al., 2017) , is used, and we reserve the larger scale data for the larger model experiments (Appendix B.4) and interpolations (Section 5.3). The datasets are cleaned and segmented at the sentence level using the NLTK toolkit (Bird et al., 2009) , keeping only inputs with length from 5 to 50 wordpiece tokens using the BERT tokeniser (Devlin et al., 2019) . Dataset statistics can be found in 

B HYPERPARAMETER TUNING AND ABLATIONS

The models are trained on the Wikitext-2 training dataset using the loss from equation 10. They are tuned on the validation dataset with the aim to be able to both reconstruct and generate output. Because both L D and L G scale approximately linearly with κ 0 , and κ 0 is linear in the sentence length, the D KL divergence losses L D and L G grow linearly with the sentence length n. Preliminary experiments suggest that training converges better without this linear dependence on sentence length, so we set the weights on the Gaussian and Dirichlet KL divergences to be linear in 1 n . In addition we scale the weights on the Gaussian KL divergence by 1 d , removing the dependence on the dimensionality d of vectors. λ D = 1 n λ ′ D ; λ G = 1 d 1 n λ ′ G where λ ′ D and λ ′ G are fixed hyperparameters. All combinations of the following hyperparameters were considered in a grid search for the respective models: • λ ′ G ={1, 1e -1 , 1e -2 , 1e -3 , 1e -4 , 1e -5 , 0} • λ ′ D ={10, 1, 1e -1 , 1e -2 , 1e -3 , 1e -4 , 1e -5 , 0} • α ∆ ={1, 0.75, 0.5, 0.4, 0.3, 0.2, 0.1, 0} • κ ∆ ={1, 2, 5} • S ={0.9, 0.8, 0.75, 0.5, 0.25} • P ={mean, max, one} where λ ′ G and λ ′ D are the weights on the Gaussian and Dirichlet KL divergences for all variational models, respectively. The α ∆ and κ ∆ are NVAE specific parameters and represent the conditional prior parameter and number of samples per component. The stride parameter S for the VTS model results in 1 -S proportion of vectors being kept. Finally, P is the pooling method for the single vector model VTP. Baselines Empirically we found the best KL divergence parameter for VT, VTP and VTS is λ ′ G =1e -2 and using max pooling for VTP. All stride parameters are considered to adjust the number of vectors. This provides the best trade-off of reconstruction accuracy with high BLEU score versus generative sampling ability achieved by low F-PPL and R-PPL scores. NVAE The hyperparameter tuning for NVAE aims to discover models which: neither collapse to a single vector nor use all vectors, reconstruct accurately, and are able to sample effectively from the prior by achieving low F-PPL and R-PPL scores. Empirically we find the parameters λ ′ G =1e -3 , λ ′ D =1 and κ ∆ = 1 to produce the best trade off between reconstruction accuracy and generative ability. The α ∆ parameter is able to control the proportion of vectors (Appendix B.2) retained and κ ∆ = 1 provides an efficient sampling of the model (Appendix B.3).

Validation results

Table 3 displays the validation reconstruction and generation results across 5 seeds for the best performing parameters. The VT model is able reconstruct well, but a high F-PPL score suggest a poor fluency of generated text and large variation. The VTP models show that a single-vector bottleneck is insufficient to reconstruct. Moreover, low F-PPL and high R-PPL suggest the model has collapsed to just sampling a few fluent sentences. The VTS models show that some fixed proportions ν of vectors retained result in good overall performance. NVAE is able to find models with comparable average proportion ν of vectors retained to those hand-coded in the VTS models. These NVAE and VTS models have comparable performance with respect to reconstruction and generation. However, the NVAE models have notably more variance of metrics across seeds. 4c shows that the NVAE model is able to dynamically reduce the number of vectors and still reconstruct, comparably to the VTS models. We notice that their exist some NVAE models that have a good reconstruction-generation trade-off but are less clustered than the VTS models. Note that the R-PPL and F-PPL plot limits are cropped at 1.2 to focus on the higher performing models. 

Test results

The best seed models from Table 3 are selected for the baselines and NVAE and then evaluated on the test set and shown in Table 4 and a subset of this α ∆ ={0.75, 0.4, 0.3, 0.2} is plotted in Figure 2 .  Reconstruction Generation Model ν BLEU (↑) PPL (↓) F-PPL (↓) R-PPL (↓

B.1 NO REGULARISATION ABLATION

In this ablation we consider the behaviour of the NVAE without any regularisation. Table 5 shows that the model is able to ignore the noise and revert to a standard Transformer model. We can see this by the model retaining all its vectors as ν =1 and being able to perfectly reconstruct the input.

Reconstruction Model ν BLEU (↑)

PPL (↓) T 1 99.63 ±0.00 1.00 ±0.00 VT 1 99.63 ±0.00 1.00 ±0.00 NVAE 1 99.63 ±0.00 1.00 ±0.00 Table 5 : Reconstruction results on Wikitext-2 validation data without regularisation.

B.2 CONDITIONAL PRIOR EXPERIMENT

In this experiment we consider the effect of the conditional prior α ∆ . We select a subset of hyperparameters that managed to achieve good reconstruction and generation performance: λ ′ G = {1e -3 , 1e -4 }, λ ′ D ={1, 10}, κ ∆ ={1, 2, 5} each across 5 random seed samples. Figure 5 displays the validation reconstruction and validation generation metrics versus the average proportion ν of vectors retained. We see that the conditional prior hyperparameter α ∆ roughly corresponds to the proportion of retained vectors ν, across all the selected hyperparameter subsets. Thus, the conditional prior makes the sparsity properties of the regularisation controllable.

B.3 NUMBER OF SAMPLES EXPERIMENT

In this experiment we consider the effect of the number of samples κ ∆ ={1,2,5} per component. We select a subset of hyperparameters λ ′ G = {1e -3 , 1e -4 }, λ ′ D = {1, 10}, α ∆ = {0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1} each across 5 random seed samples. Figure 6 shows validation generation metrics F-PPL and R-PPL against reconstruction BLEU for different numbers of samples κ ∆ per component. Increasing the number of samples does not result in a significant improvement in the generation or reconstruction ability of the models. Hence, we use a single sample κ ∆ =1 per component because it is more efficient.

B.4 LARGER SCALE MODELS

We conduct a larger scale experiment to test whether NVAE is able to generalise to models with more parameters and larger datasets such as Wikitext-103. Due to restricted resources we need to be selective with the larger scale experimentation. We scale up the models (inspired by the Transformer base size (Vaswani et al., 2017) ) by using a six layer Transformer encoder and decoder with the word embedding vectors and model projections of size 512. The feed forward layers' dimensions are 2048, which leads to models of approximately 76 million trainable parameters. We use a constant learning rate of 1e -5 and trained for 11 epochs (≈150K steps). Each model experiment takes approximately 24hrs to run on a single NVIDIA Tesla v100, which was the largest compute within budget. Otherwise the same training details as Appendix A are used. We initially trained the NVAE model with the best parameters (based on lowest R-PPL score) from Table 4 . Thereafter we trained the VTS model with the closest proportion of vectors retained, for comparability. In  λ ′ G = 1e -2 • and is for the only synonym of children and 22 m. max • there is popular at other on 28 june 29 its radius, protomy road and field to leave his site. • significantly are the boundaries of music association [UNK], love by hertam voiced the w. cd abilities.

VTS

• this feature innis light or lands and diamond, is hit by campus meant their buddhist standing and causes and remained from or remained the fun. λ ′ G = 1e -2 • at major living stage its chicago and rugby one scottish discovery, germany, of 50 a the number athletes of nba best resort of the place the its social to tom anglesey the number similar at nba 00, analysts their the 50 new warriors. S = 0.5 • 1, and carolina and his confession • ione liner adapted derby, a old, all peninsula of ion body guitar with the conflicts with groups of two.

NVAE

• an growth were substituted reservoirs in hawaii hidden below south rugby to baseball flesh as front 105berry a [UNK] level take such quest. λ ′ G = 1e -3 • the duo informs law of the then'called on [UNK] yoko, the minor alert forecast in nixon ep and comparable kicking meyer he without the asia strong im tag, on any diet. λ ′ D = 1 • at [UNK], prosecution on destroyers believed as notes s other, collective carrier all dark newell as of scientology and further cutting for α ∆ = 0.4 • stone as the plans the other distinct celebrities forms ever developed of the non combinations of 2010 to the likeness temple. κ ∆ = 1 Table 7 : The first 4 samples drawn from the best models (lowest generation scores F-PPL and R-PPL) trained on Wikitext-2 dataset.

Samples (Wikitext-103) VT

• the [UNK], a salesman action mighteno the by λ ′ G = 1e -2 • k was found on night commission • at at process, an was the eastwood, henan anniversary, and of, to to years at • example, incorrectly, to served VTP • place charles had largely of the that they were any sections of the terms of the work during the lease were the staging because the construction of the staging meant bi, itised ahead. λ ′ G = 1e -2 • verpino sa ya ram, his specimen, thatakous, that november., draft the back and sentenced. april. [UNK], defeated bailey, which returned on king, jr., or i had pronounced. max • undertook the new milne coteutlam by observer that serbia quartermaster wasuka had anticipated the western infantry and the baltic offensive also been littered to the french indochina to make the intercepted an sastier. • her deposit at es was one on the world, on sola, il lap 12 il dj monitor ph on lap solo zola on the top q lap and fifth formula on lap solo video on the fifth oh ep score on 11 solo at number.

VTS

• gates source for part pre framework is chamber topales document countered beneath austriaales document countered beneath in configuration source for east pre justicedley ayeton 2008. λ ′ G = 1e -2 • the the ho. lines secret when exception cleared better when contrary surplus cleared S = 0.5 • lucy even described valid. land oflc bombed along attack aggressive. • fearing webber had 2011 shows that dolly the shows that webber had 2011 shows from 186ua raeet from 186ua ra hollywood NVAE • or the courtney burns left wayne for minimize damaged hr nor the line, the triple life the acts in blood a 5sfsus 20 even the frames to [UNK]. λ ′ G = 1e -3 • rock tank isc as dovetion now and differing actor and dump turbinesfully present pop penetrated fantasy x of the drummer reached bail in camera while attorney, detija after λ ′ D = 1 • denmark prirg theodoreka, reveals europe carries and allegedly final culture and havinginer shared forept and free it extends ground that all lost in whose nurse between state named 8. α ∆ = 0.4 • newport horse said various connor tatiana, founder's or experienced swanting artist and proposed, where fear alternative.  κ ∆ = 1

C GENERATED SAMPLE EXAMPLES

Tables 7 and 8 give examples of text generated by sampling from the prior.

D ALIGNMENT ANALYSIS

This experiment highlights the problem of latent alignment in non-NVIB models, and evaluates alignment based on position, which we use in the interpolation experiments. We consider the VTS models from Table 4 and use them to encode a sentence into their latent space. For each latent component retained by the VTS model, we perturb it with Gaussian noise and consider the resulting autoregressively decoded output. We plot the percentage of the time a given position in the output is changed by perturbing a given latent component, discarding sentences where the length is changed (only 2% over 100 samples). Figure 7 shows that the latent space for the VTS S = 0.5 model are approximately aligned by location. However in Figure 8 , for more condensed representations where S =0.75, there is an unclear alignment of latent vectors to their position. As discussed in Section 5.3, the use of mixture distributions instead of sets of vectors allows NVIB representations to avoid this problem of alignment.

E INTERPOLATION EXAMPLES

Tables 9 through 12 give examples of the text generated by interpolations in the latent space. S1 0 they were keen to instead move on with the next film, casino royale. VT 0.25 they were keen to instead move on with the next film, casino royale. 0.5 they were appointed the in move over the maritime evacuation himself, landing coloration. 0.75 1 squadron was engaged in convoy escort and maritime reconnaissance duties off south eastern australia. VTP 0.25 they were keen to instead move on with the next film, casino royale. max 0.5 they were keen to result instead on dd with the guardian over port 1904. 0.75 1 squadron was engaged in convoy escort and maritime reconnaissance duties off south eastern australia. VTS 0.25 they were keen to instead move on with the next film, casino royale. S = 0.8 0.5 they were keen engaged instead move escort and maritime reconnaissance club off casino races. 0.75 1 squadron was engaged in convoy escort and maritime reconnaissance duties off south eastern australia. NVAE 0.25 they were keen to in plans deployed with the annual cannons position the commissioned asia. α ∆ = 0.4 0.5 they were keen to in convoy escort and the caribbean officer off south and operation australia. 0.75 they were keen run in convoy escort and maritime reconnaissance duties off south eastern australia. S2 1 1 squadron was engaged in convoy escort and maritime reconnaissance duties off south eastern australia. Table 12 : Interpolation results by varying τ using sentences with the same number of input tokens.

F DENOISING ATTENTION FOR THE MEAN OF THE POSTERIOR

As introduced in Section 2.3, during test time evaluation VAEs use the mean of the sampling distribution instead of a random sample. The mean of our BFDP posterior distribution over mixture distributions is its base distribution G q 0 . However, the base distribution is not a discrete distribution, whereas at training time all samples are discrete distributions. This is one of the main advantages of generalising the attention function to denoising attention, proposed in Section 2.1. Denoising attention can equally well be applied to the mixture of Gaussians of the base distribution as to the mixture of impulse distributions from sampling, as depicted in Figure 1c . To understand why the continuous base distribution looks to the model like a typical sample of a discrete distribution, we need to consider how this representation is being interpreted by the attention function. Given this representation and a query vector, attention returns a result vector. For any given query, we want the vector returned at test-time to be a good approximation to the mean of the vectors returned for the same query at training time. This is what the base distribution is giving us. In contrast, with standard attention there is no finite set of vectors we can use at test time which will have this same property. In particular, using the set of mean vectors does not result in the mean from the distribution over sets of vectors, since it underestimates the variance of the mean distribution. More precisely, we can consider the latent representation as a parameterisation of an attention function from query vectors to result vectors. Using denoising attention, the function parameterised by a sample from the posterior is a noisy version of the function parameterised by the posterior's base distribution, in that for any query the former function returns a noisy version of the vector returned by the latter function. When the encoder specifies the posterior's base distribution G q 0 and total pseudo-count α 0 , the base distribution G q 0 specifies the mean of a distribution over such query-result functions, and the pseudo-count α 0 specifies the amount of noise (i.e. larger values mean less noise). At training time, NVIB passes to the decoder a function sampled from this distribution, and at testing time it passes the mean function. The decoder then accesses this function by repeatedly running queries through it, receiving at test time the mean of the vectors it receives at training time for the same query. To efficiently compute denoising attention applied to the mixture of Gaussians G q 0 , we take advantage of the fact that the multiplication of two Gaussian distributions is a Gaussian distribution, giving us: DAttn(u; G q 0 ) = v i α q i i α q i g(v ; µ q i ,I(σ q i ) 2 g(v;u, √ dI) v i α q i i α q i g(v ; µ q i ,I(σ q i ) 2 ) g(v;u, √ dI) dv v dv = v i α q i g(u; µ q i ,I( √ d+(σ q i ) 2 )) g(v; ( 1 √ d u+ 1 (σ q i ) 2 µ q i 1 √ d + 1 (σ q i ) 2 ),I( 1 1 √ d + 1 (σ q i ) 2 )) v i α q i g(u; µ q i ,I( √ d+(σ q i ) 2 )) g(v; ( 1 √ d u+ 1 (σ q i ) 2 µ q i 1 √ d + 1 (σ q i ) 2 ),I( 1 1 √ d + 1 (σ q i ) 2 )) dv v dv = i α q i g(u; µ q i ,I( √ d+(σ q i ) 2 )) i α q i g(u; µ q i ,I( √ d+(σ q i ) 2 )) 1 √ d u+ 1 (σ q i ) 2 µ q i 1 √ d + 1 (σ q i ) 2 where all algebraic calculations over vectors σ q i are done componentwise. See Appendix K for a version of this formula which is useful for implementation.

G EQUIVALENCE OF SET-OF-VECTOR ATTENTION AND DENOISING ATTENTION

In this section we show the equivalence of standard attention, expressed as a sum over vectors z i in Z, and denoising attention, expressed as an integral over a distribution F Z which is only nonzero at the z i . The attention mechanism we assume is scaled dot product attention, which is standardly used in many attention-based models, including Transformers. For simplicity, we consider cross attention, where a single query vector is mapped to a single result vector. In this work we view cross attention as an interface between an encoder, which provides a set of vectors Z, and a decoder, which receives a function from query vectors u ′ to result vectors. The decoder can then apply this function to as many queries as it likes. But for this equivalence we only need consider the composed function from both u ′ and Z to a result vector. Scaled dot product attention first maps the set of vectors Z ∈R n×p into keys (ZW K )∈R n×d and values (ZW V ) ∈ R n×d via weight matrices W K ,W V ∈ R p×d , respectively, and maps the query u ′ ∈ R 1×p into key space (u ′ W Q ) ∈ R 1×d via the weight matrix W Q ∈ R p×d . The key space's dimensionality d is used for scaling. Scaled dot product attention is then defined as: Attention(u ′ ,Z ; W Q ,W k ,W Q ) = softmax (u ′ W Q )(ZW K ) T √ d ZW V = softmax uZ T √ d ZW V , where u=u ′ W Q (W K ) T = Attn(u,Z)W V where u∈R 1×p is the input query u ′ projected into Z space. In the last line we rewrite scaled dot product attention in terms of a core dot product attention function Attn(u,Z) where all operations are done in the space of Z: Attn(u,Z) = softmax 1 √ d uZ T Z = n i=1 exp( 1 √ d uz T i ) n i=1 exp( 1 √ d uz T i ) z i (12) where z i is the i th row of Z. We interpret Z as specifying a probability distribution over a vector space, and we interpret the function Attn(u,Z) as Bayesian denoising of u using this distribution, as depicted in Figure 1b . The vector u is interpreted as an observation of some true vector v ∈R 1×p which has been corrupted by Gaussian noise. The true vector v was generated from a prior probability distribution specified by Z. The result of Attn(u,Z) is the expected value of this true vector v after seeing the observation u, which can be considered the best guess of the true vector given the noisy observation, and thus is a form of Bayesian denoising. To derive this interpretation of attention as Bayesian query denoising, we interpret the set Z of vectors z i as specifying a mixture distribution F Z over vectors v which consists of one impulse distribution δ zi at each vector z i weighted by the softmax over their scaled L 2 2 norms: F Z = n i=1 exp( 1 2 √ d ||z i || 2 ) n i=1 exp( 1 2 √ d ||z i || 2 ) δ zi Then we can derive this interpretation by replacing attention's sum over i with an integration over v. Attn(u,Z) = n i=1 exp( 1 √ d uz i T ) n i=1 exp( 1 √ d uz i T ) z i = n i=1 exp( 1 2 √ d ||z i || 2 ) exp(-1 2 √ d ||z i || 2 ) exp( 1 √ d uz i T ) z i n i=1 exp( 1 2 √ d ||z i || 2 ) exp(-1 2 √ d ||z i || 2 ) exp( 1 √ d uz i T ) = n i=1 exp( 1 2 √ d ||z i || 2 ) v δ zi (v) exp(-1 2 √ d ||v|| 2 ) exp( 1 √ d uv T ) v dv n i=1 exp( 1 2 √ d ||z i || 2 ) v δ zi (v) exp(-1 2 √ d ||v|| 2 ) exp( 1 √ d uv T ) dv = v n i=1 exp( 1 2 √ d ||zi|| 2 ) n i=1 exp( 1 2 √ d ||zi|| 2 ) δ zi (v) 1 √ 2π √ d exp(-1 2 √ d p k=1 (u k -v k ) 2 ) v n i=1 exp( 1 2 √ d ||zi|| 2 ) n i=1 exp( 1 2 √ d ||zi|| 2 ) δ zi (v) 1 √ 2π √ d exp(-1 2 √ d p k=1 (u k -v k ) 2 ) dv v dv = v f Z (v) g(u; v, √ dI) v f Z (v) g(u; v, √ dI) dv v dv where f Z (•) is the probability density function for distribution F Z , and g(u; v, √ dI) = 1 √ 2π √ d exp(-1 2 √ d p k=1 (u k -v k ) 2 ) is the multivariate Gaussian function with diagonal variance of √ d. The first step just adds terms which don't effect the value. The second step changes some instances of z i into an integral over v which is only nonzero when v=z i (i.e. δ zi ). Thereafter, the terms are rearranged such that the formula reduces to an expected value over v with weights proportional to the probability of generating v with the distribution F and generating the query u with Gaussian noise N (0, √ dI) added to v. Scaling the variance of the multi-dimensional Gaussian noise by √ d reduces the impact of the dimensionality d on the similarity g(u; v, √ dI) between u and v. The above derivation is inspired by the interpretation of softmax as Bayesian classification with normally distributed classes (Bishop, 1995) , and it is similar to the interpretation of attention keys as latent mixture distributions provided by Nguyen et al. (2022) . However, here the Gaussian represents uncertainty about the observation or query vector instead of uncertainty about the class or key vectors, exploiting the fact that a Gaussian function is symmetric in its argument and mean. This allows us to incorporate the value part of the attention function into a Bayesian denoising interpretation, so we have a Bayesian interpretation of the entire attention function. To the best of our knowledge, this interpretation of attention is novel. The function from equation 14 is a special case of the definition of denoising attention DAttn(u; F ) given above in equation 3, where F =F Z . The construction above indicates that any scaled dot product attention function is an example of the DAttn(u; F ) function. 4 However, while attention is only defined over sets-of-vectors Z, denoising-attention is defined over any probability distribution F over a vector space, not just finite sets of impulse distributions.

H DERIVING THE FACTORISED DIRICHLET PROCESS

In this section we derive an alternative factorisation of a DP which helps with the sampling method in Section 3.2. For notational convenience, in this section we use c as the number of components for the base distribution instead of n+1. This is still intended to include both the output of the encoder and the prior component. Here we provide the proof that FDP(G q ,α q ) = DP(G q 0 ,α q 0 ) where G q = (G q 1 ,...,G q c ), α q = (α q 1 ,...,α q c ), G q 0 = c i=1 α q i α q 0 G q i , α q 0 = c i=1 α q i , and F ∼ FDP(G q ,α q ) is defined as F = c i=1 ρ i F i ρ ∼ Dir(α q 1 ,...,α q c ) F i ∼ DP(G q i ,α q i ) for i=1,...,c We start with the definition of a DP as an infinite symmetric Dirichlet distribution. A Dirichlet process F ∼DP(G 0 ,α 0 ) can be defined as the limit of a sequence of finite Dirichlet distributions (see Teh (2010) ): F = ∞ k=1 π k δ z k π ∼ lim κ0→∞ Dir( α 0 κ 0 , κ0 ..., α 0 κ 0 ) z k ∼ G 0 for k =1,...,∞ Note that the weights π and the vectors z k are independent of each other, so we can treat these two issues separately. For the vectors, we know that after generating an infinite number of z k from G 0 , a proportion of exactly α q i α q 0 of them will be generated from G q i . For a finite number of vectors κ 0 , let κ i be the number of z k generated from G q i , for each i. So we have lim κ0→∞ κ i κ 0 = α q i α q Given the exchangeability of Dirichlet distributions, we can renumber the κ 0 categories of Dir( α0 κ0 , κ0 ..., α0 κ0 ) so that π = (π 11 ,...,π 1κ1 , c ..., π c1 ,...,π cκc ) and the π i1 ,...,π iκi are all weights for vectors z ij generated from component i. z ij ∼ G q i for i=1,...,c; j =1,...,κ i For the weights, we again consider the case of finite κ 0 before taking the limit as κ 0 goes to infinity, using the above indexing where categories ij are partitioned according to their vector's base distribution component i. We define (ρ 1 , c ...,ρ c ) to be the vector of total weights ρ i = κi j=1 π ij for each of these partitions i. By the rule for merging categories in a Dirichlet distribution, we know that these total weights are themselves distributed according to a Dirichlet distribution. (ρ 1 , c ...,ρ c ) ∼ Dir(α q 1 , c ...,α q c ) Now we take advantage of the neutrality property of Dirichlet distributions. It states that this vector (ρ 1 , c ...,ρ c ) of partition weights and all of the vectors ( πi1 ρi , κi ..., πiκ i ρi ) of normalised weights inside each partition are independent. In essence, this means that the only way that the weights inside each partition constrain each other is through normalisation, so when normalisation is factored out they become independent. This independence allows us to compute the distribution over ( πi1 ρi , κi ..., πiκ i ρi ) by simply marginalising out all the other categories. We first merge all the categories outside partition i into a single category, whose weight is thus 1-ρ i . This gives us the marginalised Dirichlet distribution (π i1 , κi ...,π iκi , (1-ρ i )) ∼ Dir( α q 0 κ0 , κi ..., α q 0 κ0 , α q 0 (1-κi )). Let d(π ; α) be the probability density function for the distribution Dir(α): d(π i1 , κi ...,π iκi , (1-ρ i ); α q 0 κ 0 , κi ..., α q 0 κ 0 , α q 0 (1- κ i κ 0 )) = Γ(α q 0 ) Γ(α q 0 (1-κi κ0 )) κi j=1 Γ( α q 0 κ0 ) (1-ρ i ) α q 0 (1- κ i κ 0 )-1 κi j=1 π ij α q 0 κ 0 -1 = Γ(α q 0 ) Γ(α q 0 (1-κi κ0 )) κi j=1 Γ( α q 0 κ0 ) (1-ρ i ) α q 0 (1- κ i κ 0 )-1 (ρ i ) α q 0 κ i κ 0 -1 ( κi j=1 ( π ij ρ i ) α q 0 κ 0 -1 ) Now we can marginalise out the weight of the outside category by integrating over ρ i . 1 ρi=0 Γ(α q 0 ) Γ(α q 0 (1-κi κ0 )) κi j=1 Γ( α q 0 κ0 ) (1-ρ i ) α q 0 (1- κ i κ 0 )-1 (ρ i ) α q 0 κ i κ 0 -1 ( κi j=1 ( π ij ρ i ) α q 0 κ 0 -1 ) dρ i =   Γ(α q 0 ) Γ(α q 0 (1-κi κ0 )) κi j=1 Γ( α q 0 κ0 ) 1 ρi=0 (1-ρ i ) α q 0 (1- κ i κ 0 )-1 (ρ i ) α q 0 κ i κ 0 -1 dρ i   ( κi j=1 ( π ij ρ i ) α q 0 κ 0 -1 ) = d( π i1 ρ i , κi ..., π iκi ρ i ; α q 0 κ 0 , κi ..., α q 0 κ 0 ) where in the last step we note that the integral (assuming it is well defined) is simply part of the normalisation constant, which we know from the definition of the Dirichlet distribution must be B( α q 0 κ0 , κi ..., α q 0 κ0 ). This gives us ( π i1 ρ i , κi ..., π iκi ρ i ) ∼ Dir( α q 0 κ 0 , κi ..., α q 0 κ 0 ) Now that we have all the individual distributions, we can put them together to get the factorised distribution for the case of finite κ 0 . π ij = ρ i π ′ ij for i=1,...,c; j =1,...,κ i ρ ∼ Dir(α q 1 , c ...,α q c ) π ′ i ∼ Dir( α q 0 κ 0 , κi ...... , α q 0 κ 0 ) for i=1,...,c Noting that lim κ0→∞ α q 0 κ0 = α q i κi , we can then take the limit as κ 0 goes to infinity to get our definition of the weights for the factorised Dirichlet distribution. π ij = ρ i π ′ ij for i=1,...,c; j =1,...,∞ ρ ∼ Dir(α q 1 , c ...,α q c ) π ′ i ∼ lim κi→∞ Dir( α q i κ i , κi ..., α q i κ i ) for i=1,...,c Thus the weights of a DP can be rewritten as the weights of an equivalent FDP using the above construction. Putting the vectors and weights together, we get the distribution F i over the weighted vectors in each partition i. z ij ∼ G q i for i=1,...,c; j =1,...,κ i π ′ i ∼ lim κi→∞ Dir( α q i κ i , κi ..., α q i κ i ) for i=1,...,c and thus F i ∼ DP(G q i ,α q i ) for i=1,...,c This concludes our proof that, if F ∼DP(G q 0 ,α q 0 ), then: F = c i=1 ρ i F i ρ ∼ Dir(α q 1 , c ...,α q c ) F i ∼ DP(G q i ,α q i ) for i=1,...,c and thus FDP(G q ,α q )=DP(G q 0 ,α q 0 ).

I DERIVING THE KL DIVERGENCE

In this section we derive the KL divergence between the prior and the posterior. We also argue that this function is approximately linear in the number of sampled vectors for each component. To directly compare the posterior with the prior, we first reformulate the prior as a factorised DP with the same form as the posterior. We can do this without changing the distribution specified by the prior, simply by making n + 1 copies of the base distribution G p 0 and weighting those copies proportionately to the weights α q i α q 0 of the components of the posterior base distribution. This gives us the prior BFDP(G p ,α p ′ ,κ ∆ ) where G p =(G p 0 , n+1 ... ,G p 0 ) and α p ′ =α q α p ′ 0 α q 0 =(α p ′ 0 α q 1 α q 0 , n+1 ... ,α p ′ 0 α q n+1 α q ). The formulation of both the prior and posterior as bounded factorised DPs of the same form simplifies the computation of the KL divergence, because the KL divergence for each respective pair of factors can be computed separately, and then combined. First consider the factors for the Dirichlet distributions over the partitions for the different components i. There is a closed-form solution to the KL divergence between two Dirichlet distributions. D KL Dir(α q 1 , n+1 ... ,α q n+1 ) ∥ Dir(α p ′ 0 α q 1 α q 0 , n+1 ... ,α p ′ 0 α q n+1 α q 0 ) = log Γ(α q 0 ) Γ(α p ′ 0 ) + n+1 i=1   -log Γ(α q i ) Γ(α p ′ 0 α q i α q 0 ) +α q i (1- α p ′ 0 α q 0 )(ψ(α q i )-ψ(α q 0 ))   where Γ is the gamma function and ψ is the digamma function. For the bounded DPs for each individual component i, there are two factors, a symmetric Dirichlet distribution over the weights and a Gaussian distribution over each vector. For the symmetric Dirichlet distribution, in the case where κ i =1, then the KL for this term is zero, since there is no choice to make for this weight. In the case where κ i > 1, the KL divergence between the posterior and prior versions of these weight distributions again has a closed-form solution. D KL Dir( α q i κi , κi ..., α q i κi ) ∥ Dir(α p ′ 0 α q i α q 0 κi , κi ...,α p ′ 0 α q i α q 0 κi ) = log Γ(α q i ) Γ(α p ′ 0 α q i α q 0 ) -κ i log Γ( α q i κi ) Γ(α p ′ 0 α q i α q 0 κi ) +α q i (1- α p ′ 0 α q 0 )(ψ( α q i κi )-ψ(α q i )) This term then needs to be summed across components 1≤i≤n+1. For the factors for generating vectors from each individual component of the base distribution, because the different components have been factorised, there is also a closed-form solution to computing these KL divergences. Each Gaussian component of the posterior's base distribution is compared independently to the Gaussian of the prior's base distribution. The KL divergence between two Gaussians (with diagonal covariance with values σ) is: D KL (G q i ∥ G p 0 ) = 1 2 d h=1 ( (µ q ih -µ p h ) 2 (σ p h ) 2 + (σ q ih ) 2 (σ p h ) 2 -1-log( (σ q ih ) 2 (σ p h ) 2 )) = 1 2 d h=1 ((µ q ih ) 2 +(σ q ih ) 2 -1-log((σ q ih ) 2 )) where the last step assumes µ p =0, (σ p ) 2 =1. This term then needs to be multiplied by the number κ i of vectors for this component, and summed across components 1≤i≤n+1. Given these exact closed-form solutions for each pair of factors, we can compute the full KL divergence. We start by combining the formulas for the weight factors, where some terms cancel: D KL Dir(α q 1 , n+1 ... ,α q n+1 ) ∥ Dir(α p ′ 0 α q 1 α q 0 , n+1 ... ,α p ′ 0 α q n+1 α q 0 ) + n+1 i=1 D KL Dir( α q i κi , κi ..., α q i κi ) ∥ Dir(α p ′ 0 α q i α q 0 κi , κi ...,α p ′ 0 α q i α q 0 κi ) = log Γ(α q 0 ) Γ(α p ′ 0 ) - n+1 i=1 κ i log Γ( α q i κi ) Γ( α p ′ 0 α q i α q 0 κi ) +(1- α p ′ 0 α q 0 ) n+1 i=1 α q i (ψ( α q i κi )-ψ(α q 0 )) Now we can put all these pieces together. D KL (BFDP(G q ,α q ,κ) ∥ BFDP(G p ,α q α p ′ 0 α q 0 ,κ)) = ρ π ′ v d(ρ; α q 1 ,...,α q n+1 ) n+1 i=1 d(π ′ i ; α q i κi , κi ..., α q i κi ) n+1 i=1 κi j=1 G q i (z ij ) log BFDP(G p ,α q α p ′ 0 α q 0 ,κ) BFDP(G q ,α q ,κ) dρ dπ ′ dv = D KL Dir(α q 1 , n+1 ... ,α q n+1 ) ∥ Dir(α p ′ 0 α q 1 α q 0 , n+1 ... ,α p ′ 0 α q n+1 α q 0 ) + n+1 i=1 D KL Dir( α q i κi , κi ..., α q i κi ) ∥ Dir(α p ′ 0 α q i α q 0 κi , κi ...,α p ′ 0 α q i α q 0 κi ) +κ i D KL (G q i ∥ G p 0 ) = logΓ(α q 0 )-logΓ(α p ′ 0 )+ n+1 i=1 κ i logΓ( α p ′ 0 α q i α q 0 κ i )-logΓ( α q i κi ) +(α q 0 -α p ′ 0 ) -ψ(α q 0 )+ n+1 i=1 α q i α q 0 ψ( α q i κi ) + 1 2 n+1 i=1 κ i d h=1 (µ q ih -µ p h ) 2 (σ p h ) 2 + (σ q ih ) 2 (σ p h ) 2 -log (σ q ih ) 2 (σ p h ) 2 -1 Equation 15 gives the KL portion of the loss when we are given the full set κ of numbers of vectors κ i generated for each component i. If we assume that the κ i are chosen stochastically, then we can take advantage of the fact that equation 15 is approximately linear in κ i , when the variation in κ i is fairly small relative to the values of κ i . The Gaussian term is exactly linear in κ i , and the terms ψ( α q i κi ) and κ i logΓ( α q i α q 0 κi )-logΓ( α q i κi ) are both approximately linear in κ i . This allows us to approximate the expectation over κ i of this loss as this loss of the expectation over κ i , as discussed in Section 3.1. In this case, this approximation of the full KL divergence is: D KL (BFDP(G q ,α q ,κ) ∥ BFDP(G p ,α q α p ′ 0 α q 0 ,κ)) ≈ logΓ(α q 0 )-logΓ(α p ′ 0 )+(α q 0 -α p ′ 0 )) ψ( α q 0 κ0 )-ψ(α q 0 ) +κ 0 logΓ( α p ′ 0 κ 0 )-logΓ( α q 0 κ 0 ) (16) + 1 2 κ 0 n+1 i=1 α q i α q 0 d h=1 (µ q ih -µ p h ) 2 (σ p h ) 2 + (σ q ih ) 2 (σ p h ) 2 -1-log (σ q ih ) 2 (σ p h ) 2

J REPARAMETERISATION TRICK AND SAMPLING

In this section we consider the reparameterisation trick to allow backpropagation through the sampling step. We consider the component Gaussians of the base distribution and the weights generated by the DP separately. J.1 SAMPLING VECTORS FROM A COMPONENT OF THE BASE DISTRIBUTION Each vector z k is sampled independently from some specific component G q i of the base distribution G q 0 . Since we assume that all these components are distributed according to a Gaussian G q i =N (µ q i ,I(σ q i ) 2 ), we can sample from this distribution using location-scale shifting (Kingma & Welling, 2014) : z k = µ q i +σ q i ϵ k (17) ϵ k ∼ N (0,1) Since the random sampling comes from an unparameterised unit Gaussian, there is no need to backpropagate error into this sampling step, but the error can be backpropagated into µ q i and σ q i given a specific sample. This is the reparameterisation trick for Gaussian distributions.

J.2 SAMPLING FROM A DIRICHLET DISTRIBUTION

A Dirichlet distribution over category weights can be sampled by sampling from a Gamma distribution for each category and then normalising. A sum-normalised set of κ random variables π 1 ,...,π κ ∼Dir(α 1 ,...,α κ ) follows a Dirichlet distribution if the unnormalised random variables γ i each follow a Gamma distribution. π i = γ i κ i γ i γ i ∼ Γ(α i ,β=1) where Γ(α i ,β=1) is the Gamma distribution with β=1, whose PDF is f(x)= 1 Γ(αi) exp((α i -1)log(x)x). There is no closed-form explicit reparameterisation trick for the Gamma distribution, but there are for approximations. We propose to use a combination of two approximations for the Gamma distribution which have a reparameterisation trick, one for small values of α i and one for larger values of α i . 5Inverse CDF approximation to Gamma distributions The Gamma distribution cannot use location-scale shifting for sampling due to its asymmetry, nor can the parameters and noise components be decoupled in the inverse CDF. Hence, (Knowles, 2015) suggests sampling using an approximation to the inverse CDF of the Gamma distribution of the following form: γ i ≈ β -1 (u i α i Γ(α i )) 1/αi , u i ∼ U(0,1). This approximation allows the inverse CDF of the Gamma distribution to be a function of the parameters and independent noise from a uniform distribution U(0, 1). However, this approximation is only recommended when the value of α i <1 and β =1. In our case β =1 but we sometimes have large α i . Gaussian approximation to Gamma distributions Knowles (2015) further mentions that the Gaussian distribution can be used to approximate a Gamma distribution for larger α. Bahuleyan et al. (2018) uses a similar approach to approximate variational attention weights. The Gaussian distribution is a symmetric distribution which can be sampled by location-scale shifting, as discussed above. The Gamma distribution Γ(α,β=1) can be approximated with a Gaussian of the form γ ∼N (α, √ α) (Knowles, 2015) , which gives us: γ i ≈ α i + √ α i ϵ i (20) ϵ i ∼ N (0,1) The Gaussian distribution is symmetric and can take on negative values. Hence, this approximation is inappropriate for the Gamma distribution unless the α i parameter is sufficiently large, otherwise the sample will need to be truncated to a value greater than zero.



The code is available at https://github.com/idiap/nvib and https://github.com/idiap/nvib transformers. √d ||z i || 2 . In the rest of this paper, we will use DAttn(u; F ) as our definition of attention, which allows us to treat the latent space of a Transformer encoder-decoder as mixture distributions. We will use "p" and "q" superscripts to designate variables for the prior and posterior, respectively. Similarly, a zero subscript is part of the name of the variable, in contrast to positive integer subscripts which are indices. F-PPL is calculated across interpolations using a Transformer language model trained on Wikitext-103 at the same scale as the larger models. We remove any collapses to exactly S1 or S2 as this will bias the F-PPL metric favourably. There may be other constructions which could equally well be used to implement Attn(u,Z) in terms of DAttn(u; F ), but all that is important here is the existence of one. Equally, we do not intend to claim that all DAttn(u; F ) functions can be implemented in terms of Attn(u,Z). Indeed, the greater generality of DAttn(u; F ) is crucial in this work. We leave the investigation of implicit reparameterisation gradients(Figurnov et al., 2018) to future work. This approach is an alternative to explicit reparameterisation for cases like Gamma distributions, but first we investigate the approach which is more standard in the VAE literature.



(a) The NVAE model and its NVIB layer. (b) Denoising attention. (c) Test-time denoising attention.

Figure 1: (a) Illustration of the NVAE model, with its NVIB layer. (b) Query denoising attention at training time, with the sampled distribution as the query prior, a noisy query observation, and the expected value of the denoised query. (c) Query denoising attention at test time using the mean distribution.

F-PPL vs test BLEU. R-PPL vs test BLEU. Test BLEU vs proportion of retained vectors ν. Latent quantity of vectors vs input tokens.

Figure 2: Reconstruction versus generation trade-off (a), (b) and regularisation analysis (c), (d).

Figure 3: BLEU with S 1 versus S 2 for varying interpolations τ.

Figure 4: Reconstruction versus generation trade-off and regularisation for Wikitext-2 validation.

Figure 5: Conditional prior α ∆ controlling the proportion of retained vectors ν.

Figure 6: Number of samples κ ∆ effect on reconstruction and generation trade off.

Figure 7: Latent vector alignment with generated output

ancient graffiti and possesses low windows. VT 0.25 the palace has ancient graffiti and possesses low windows. 0.5 the palace a modern class and enthusiastically and cinema. 0.75 smoke signals a history of native americans in cinema. VTP 0.25 the palace has ancient graffiti and possesses low windows. max 0.5 the soul room a ancient and republicansium in downtown. 0.75 smoke signals a history of native americans in cinema. VTS 0.25 the palace has ancient graffiti and possesses low windows. S = 0.8 0.5 smoke miners has the graffiti native villagers low schools. 0.75 smoke signals a history of native americans in cinema. NVAE 0.25 the palace has ancient economics and larger war butterfly. α ∆ = 0.4 0.5 smokeide s historical gifts of german language in 29. 0.75 smokepers is judicial topics of german language cinema. S2 1 smoke signals a history of native americans in cinema.

These findings are qualitatively confirmed through different examples (Appendix E). The proportion of interpolations different from S 1 and S 2 by varying the interpolation rate τ. Fluency metric F-PPL of interpolations when τ =0.5.



Dataset statistics. Number of sentence examples in the train, validation and test sets. Number of word piece tokens per sentence example.



Large scale model results for regularisation and generation on test Wikitext-103. • video game,s's reprinted le nes 2010 allowing track video game, browns passengers for the third race, [UNK] he, and • tropical confronted were were level prime move criminal discussed color topical liberty camp dated confronted an so so topical series camph created the located better move topical replacement from confronted disrupted destiny newmarket thrust the nine confronted confronted destiny camp topical topical controlled future great future camp near • runway a game capacity list to him a a, attitudes at forwards pageant grand grand, flash bugs forwards at during made winds. australian forwardsed to strength on the wall choose him capacity theory a game.VTP• vocals responded off down in their episodes and rachel originally a sequel the simpsons [UNK] in the second episode of these episodes [UNK]

The first 4 samples drawn from the large scale models trained on Wikitext-103 dataset.

Interpolation results by varying τ using sentences with a different number of input tokens. the time, the [UNK] saw the source and the [UNK] envoys still refused for celebration. 0.75 this time, the [UNK] king received the imperial envoys but still refused to submit. VTS 0.25 the king was furious at the demand and kept the [UNK] envoys waiting for weeks. S = 0.8 0.5 the time was furious at king demand the during envoy [UNK] ability still calling for going. 0.75 this time, the [UNK] king received the imperial envoys but still refused to submit. NVAE 0.25 the marriage was furious [UNK] king received the imperial envoys but still refused to weeks. α ∆ = 0.4 0.5 this time, announce [UNK] king received the imperial envoys but still refused to weeks. 0.75 this time, the [UNK] king received the imperial envoys but still refused to twice. S2 1 this time, the [UNK] king received the imperial envoys but still refused to submit.

Interpolation results by varying τ using sentences with the same number of input tokens.

Interpolation results by varying τ using sentences with a different number of input tokens.

ACKNOWLEDGEMENTS

We would like to thank Florian Mai, Andrei Catalin Coman, Melika Behjati, Andreas Marfurt, and other members of the Idiap NLU group for helpful comments and suggestions. This work was funded in part by the Swiss National Science Foundation under the NCCR grant Evolving Language, Swiss National Science Foundation Agreement #51NF40 180888.

REPRODUCIBILITY STATEMENT

We faithfully describe the details of method in the text and provide detailed derivations for theoretical grounding (Sections F, G, H, I). We mention all relevant hardware, hyperparameters and datasets to reproduce our experiments (Section A). We provide variance estimations across seeds and ablations to justify design choices (Section B). Finally, we provide exact equations for denoising attention used to make implementation easier (Section K). The open source code for this research has been released at https://github.com/idiap/nvib for the NVIB layer in PyTorch and https://github.com/idiap/nvib transformers for the experiments.

AUTHOR CONTRIBUTIONS

James Henderson is responsible for the high-level vision and the theoretical derivations and proofs related to Bayesian nonparametrics. Fabio Fehr is responsible for the sampling method, evaluations, implementations and running of experiments.

Published as a conference paper at ICLR 2023

The combined reparameterisation of Gamma distributions To visualise the error for these two approximations, their average L 1 distance from the true Gamma inverse CDF is plotted in Figure 9 for different values of α. The plot shows that the approximation error is equal when α=0.6363. To take advantage of the strengths of both these approximations, we propose to reparameterise the Gamma distribution as a blend of these two approximations with a switch at α=0.6363 and truncate negative samples to zero. 

K PRACTICAL IMPLEMENTATION OF DENOISING ATTENTION

In this section we provide the equations used to allow denoising attention to be implemented in a deep learning framework at training time and test time.Denoising attention at training time During training, the set of vectors Z ∈ R n×p and their log-probability values log(π)∈R 1×n are both sampled and output by the NVIB layer, thereby specifying the sampled mixture distribution F . For each use of denoising attention, the query u ′ ∈R 1×p is projected by the grouped matricesThe keys' dimensionality d is used for scaling. Denoising attention can then be computed as:We define this for u∈R 1×p , but this can easily be extended to multiple queries.Denoising attention at test time During test time, we do not sample F , but instead use the mean of the posterior distribution, which is its base distribution G q 0 . The NVIB layer takes its input from the encoder and maps it to the parameters (µ q ,σ q , α q α q 0 ) of this base distribution G q 0 = i α q i α q 0 N (µ q i ,I(σ q i ) 2 ). For convenience let (σ r i ) 2 =( √ d+(σ q i ) 2 ). Test-time denoising attention can then be computed as:where 1 p is a row vector of p ones.A caveat of this derivation is that it applies only for single-head attention and is not trivial to extend for multi-head attention. We leave this for future work.

