UNSUPERVISED DISCOVERY OF INTERPRETABLE LATENT MANIPULATIONS IN LANGUAGE VAES

Abstract

Language generation models are attracting more and more attention due to their constantly increasing quality and remarkable generation results. State-of-the-art NLG models like BART/T5/GPT-3 do not have latent spaces, therefore there is no natural way to perform controlled generation. In contrast, less popular models with explicit latent spaces have the innate ability to manipulate text attributes by moving along latent directions. For images, properties of latent spaces are wellstudied: there exist interpretable directions (e.g. zooming, aging, background removal) and they can even be found without supervision. This success is expected: latent space image models, especially GANs, achieve state-of-the-art generation results and hence have been the focus of the research community. For language, this is not the case: text GANs are hard to train because of non-differentiable discrete data generation, and language VAEs suffer from posterior collapse and fill the latent space poorly. This makes finding interpetable text controls challenging. In this work, we make the first step towards unsupervised discovery of interpretable directions in language latent spaces. For this, we turn to methods shown to work in the image domain. Surprisingly, we find that running PCA on VAE representations of training data consistently outperforms shifts along the coordinate and random directions. This approach is simple, data-adaptive, does not require training and discovers meaningful directions, e.g. sentence length, subject age, and verb tense. Our work lays foundations for two important areas: first, it allows to compare models in terms of latent space interpretability, and second, it provides a baseline for unsupervised latent controls discovery.

1. INTRODUCTION

Transformer-based models yield state-of-the-art results on a number of tasks, including representation learning (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020) and generation (Radford et al.; Raffel et al., 2019; Lewis et al., 2020) . Notably, large language models have been reported to produce outputs nearly indistinguishable from human-written texts (Brown et al., 2020) . Although the predictions of autoregressive language models are fluent and coherent, it is not clear how to manipulate the model to get samples with desired properties. For example, make them shorter, more formal or more positive, or, alternatively, use the same model to rewrite human-written texts in a different tone. Current approaches often rely on external labels of target attributes and require modifications to the model. This involves retraining for new attributes or changing the decoding procedure, which is usually expensive. In contrast, models with explicit latent spaces have the innate ability to manipulate text attributes by moving along latent directions. They, however, gained limited traction. One reason is that training a VAE on text data poses a number of optimization challenges, which have been tackled with a varying degree of success (He et al., 2019; Fu et al., 2019; Zhu et al., 2020) . Additionally, language VAEs are mostly small LSTM-based models which goes against the current trend of using large pretrained Transformers. The first large-scale language VAE model is the recently introduced OPTIMUS (Li et al., 2020): it uses BERT as the encoder and GPT-2 as the decoder, and sets a new record on benchmark datasets. Differently from texts, latent space models for images, especially GANs, achieve state-of-the-art generation results. Therefore, these models have been the focus of the research community, and the properties of latent spaces are well-learned. For example, even early works on generative adversarial networks for images report that it is possible to have smooth interpolations between images in the latent space (Goodfellow et al., 2014) . More recent studies show that the latent space directions corresponding to human-interpretable image transformations (from now on, "interpretable directions") can be discovered in an unsupervised way (Härkönen et al., 2020; Voynov & Babenko, 2020; Peebles et al., 2020) . In this paper, we show that for the language domain, much alike the well-studied visual domain, a sufficiently "good" latent space allows to manipulate sample attributes with relative ease. To avoid the known difficulties associated with training language GANs, we experiment with VAEs; more specifically, with the current state-of-the-art model OPTIMUS. We show that for this model, not only it is possible to produce meaningful and "smooth" interpolations between examples and to transfer specific properties via arithmetic operations in the latent space, but it is also possible to discover the interpretable latent directions in an unsupervised manner. We propose a method based on the PCA of latent representations of the texts in the training dataset. According to human evaluation, the proportion of interpretable directions among the ones found by our method is consistently larger than the proportion of interpretable directions among canonical co-ordinates or random directions in the latent space. The meaningful directions found by this method include, for example, subject age, subject gender, verb tense, and sentence length. Some of the directions, e.g. sentence length, are potentially useful: the ability to expand or shrink a text while preserving its content may be useful for tasks like summarization. Note that the proposed method is simple and fast. The method is simple because it requires only the forward pass of the encoder, without backpropagating through decoding steps. This is very important for the language domain, where backpropagation through samples is significantly more difficult than for images. Namely, generation is non-differentiable, and previous attempts to overcome this issue relied on noisy or biased gradient estimates, which is less reliable than the standard MLE training. Instead, we do not rely on generated samples at all: we operate directly in the latent space. Additionally, since sampling directly from the prior does not yield diverse samples in case of OPTIMUS, we use the representations of the training data without running a decoding procedurethis maked the method fast. To summarize, our contributions are as follows: 1. We propose the first method for unsupervised discovery of interpretable directions in latent spaces of language VAEs. 2. This method is simple and fast: it is based on PCA of latent representations for texts in the training dataset. 3. This method is effective: the proportion of interpretable directions among the ones found by our method is consistently larger than that of canonical co-ordinates or random directions in the latent space. 4. Our work lays foundations for two important areas: first, it allows to compare models in terms of latent space interpretability, and second, it provides a baseline for unsupervised latent controls discovery.

2. RELATED WORK

Finding interpretable directions in latent spaces of language VAEs is related to three lines of work. First, latent variable models for text and, more specifically, properties of latent spaces: for interpretable directions to exist, latent space has to be smooth (i.e. allow coherent interpolations). Then, since great part of the motivation for finding interpretable directions is manipulating generated texts, we discuss works on controllable text generation for different types of models, both VAE and standard autoregressive. Finally, we mention recent works trying to discover interpretable directions in image GANs.

2.1. LATENT VARIABLE MODELS FOR TEXT

Latent variable models encode information about text into a probability distribution. In addition to sampling new sentences from the prior distribution, they potentially allow to explicitly encode

