ARIEL: VOLUME CODING FOR SENTENCE GENERATION COMPARISONS Anonymous authors Paper under double-blind review

Abstract

Mapping sequences of discrete data to a point in a continuous space makes it difficult to retrieve those sequences via random sampling. Mapping the input to a volume would make it easier to retrieve at test time, and that is the strategy followed by the family of approaches based on Variational Autoencoder. However the fact that they are at the same time optimizing for prediction and for smoothness of representation, forces them to trade-off between the two. We benchmark the performance of some of the standard methods in deep learning to generate sentences by uniformly sampling a continuous space. We do it by proposing AriEL, that constructs volumes in a continuous space, without the need of encouraging the creation of volumes through the loss function. We first benchmark on a toy grammar, that allows to automatically evaluate the language learned and generated by the models. Then, we benchmark on a real dataset of human dialogues. Our results indicate that the random access to the stored information can be significantly improved, since our method AriEL is able to generate a wider variety of correct language by randomly sampling the latent space. VAE follows in performance for the toy dataset while, AE and Transformer follow for the real dataset. This partially supports the hypothesis that encoding information into volumes instead of into points, leads to improved retrieval of learned information with random sampling. We hope this analysis can clarify directions to lead to better generators.

1. INTRODUCTION

It is standard for neural networks to map an input to a point in a d-dimensional real space (Hochreiter and Schmidhuber, 1997; Vaswani et al., 2017; LeCun et al., 1989) . However, that makes it difficult to find a specific point when the real space is being sampled randomly. That can limit the applicability of pre-trained models to their initial scope. Some approaches do map an input into volumes in the latent space. The family of approaches that stems out of the idea of Variational Autoencoders (Kingma and Welling, 2014; Bowman et al., 2016; Rezende and Mohamed, 2015; Chen et al., 2018) are trained to encourage such type of representations. By encoding an input into a probability distribution that is sampled before decoding, several neighbouring points in R d can end up representing the same input. However, it often implies having two summands in the loss, a log-prior term and a log-likelihood term (Kingma and Welling, 2014; Bowman et al., 2016) , that fight for two different causes. In fact, if we want a smooth and volumetric representation, encouraged by the log-prior, it might come at the cost of having worse reconstruction or classification, encouraged by the log-likelihood. Therefore, each diminishes the strength and influence of the other. By giving partially up on the smoothness of the representation, we propose instead a method to explicitly construct volumes, without a loss that is implicitly encouraging such behavior. We propose AriEL, a method to map sentences to volumes in R d for efficient retrieval with either random sampling, or a network that operates in its continuous space. It draws inspiration from arithmetic coding (AC) (Elias and Abramson, 1963) and k-d trees (KdT) (Bentley, 1975) , and we name it after them Arithmetic coding and k-d trEes for Language (AriEL). For simplicity we choose to focus on language, even though the technique is applicable for the coding of any variable length sequence of discrete symbols. More precisely, we plan to use AriEL in the context of dialogue systems with the goal to provide a tool to optimize interactive agents. The interaction of AriEL with longer text is left as future work. In particular, it can be used as an objective benchmark to compare other methods and to understand how their latent space is used. AriEL attempts to fill completely the latent space with the sentences present in the training dataset, using notions from information theory. AriEL uses a language model to split the latent space in volumes guided by the probability assigned to the next symbol in a sentence. For this reason, it can simplify as well the reuse of pretrained language models for new tasks and in larger architectures. In fact, it can provide a training agent with a simpler interface with a language model, e.g. a GPT-2 (Radford et al., 2019) , where the agent could choose the optimal dimensionality of the interwface. We prove how such a volume representation eases the retrieval of stored learned patterns and how to use it to set references for other models. Our contributions are therefore: • AriEL, a volume coding technique based on arithmetic coding and k-d trees (Section 3.1), to improve the retrieval of learned patterns with random sampling; • the use of a context-free grammar and a random bias in the dataset (Section 3.3), that allows us to automatically quantify the quality of the generated language; • the notion that explicit volume coding (Section 2 and 5) can be a useful technique in tasks that involve the generation of sequences of discrete symbols, such as sentences.

2. RELATED WORK

Volume codes: We define a volume code as a pair of functions, an encoder and a decoder functions, where the encoder maps an input x into a set that contains compact and connected sets of R d (Munkres, 2018), and the decoder maps every point within that set back to x. It is a form of distributed representations (Hinton et al., 1984) since the latter only assumes that the input x will be represented as a point in R d . We define point codes as the distributed representations that are not volume codes. Volume codes differ from coarse coding (Hinton et al., 1984) since in this case the code is represented by a list of zeros and ones that identifies in which overlapping sets x falls into. We call implicit volume codes, when the volume code is encouraged through a term in the loss function bengio2013representation. Both generative and discriminative models (Ng and Jordan, 2002; Kingma and Welling, 2014; Jebara, 2012) can learn volume codes this way. We call explicit volume code, when the volumes are constructed instead through the operations that define the architecture, and are created independently from any loss and optimizer choice. Sentence generation through random sampling: Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) map random samples to a learned generation through a 2-players game training procedure. They have had trouble for text generation, due to the non differentiability of the argmax at the end of the generator, and given that partially generated sequences are non trivial to score (Yu et al., 2017) . Several advances have significantly improved their performance for text generation, such as using the generator as a reinforcement learning agent trained through Policy Gradient (Yu et al., 2017) , avoiding a binary classification in favor of a cross-entropy for the discriminator that evaluates each word generated (Xu et al., 2018) , or with the Gumbel-Softmax distribution (Kusner and Hernández-Lobato, 2016) . Random sampling the latent space is used as well by Variational Autoencoders (VAE) (Kingma and Welling, 2014), to smooth the representation of the learned patterns. Training VAE for text has been shown to be possible with KL annealing and word dropout (Bowman et al., 2016) , and made easier with convolutional decoders (Severyn et al., 2017; Yang et al., 2017) . Several works explore how VAE and GAN can be combined (Makhzani et al., 2015; Tolstikhin et al., 2017; Mescheder et al., 2017) . AriEL can be used as a generator or a discriminator in a GAN, or as an encoder or a decoder in an autoencoder. However it differs from them in the explicit procedure to construct volumes in the latent space that correspond to different inputs. The intention is to fill the entire latent space with the learned patterns, to ease the retrieval by uniform random sampling. Arithmetic coding and neural networks: AC is one of the most efficient lossless data compression techniques (Witten et al., 1987; Elias and Abramson, 1963) . AC assigns a sequence to a segment in [0,1] whose length is proportional to its frequency in the dataset. AC is used for neural network compression (Wiedemann et al., 2019) but typically, neural networks are used in AC as the model of the data distribution, to perform prediction based compression (Pasero and Montuori, 2003; Triantafyllidis and Strintzis, 2002; Jiang et al., 1993; Ma et al., 2019; Tatwawadi, 2018) . We turn AC

