UNSUPERVISED DISCOVERY OF INTERPRETABLE LATENT MANIPULATIONS IN LANGUAGE VAES

Abstract

Language generation models are attracting more and more attention due to their constantly increasing quality and remarkable generation results. State-of-the-art NLG models like BART/T5/GPT-3 do not have latent spaces, therefore there is no natural way to perform controlled generation. In contrast, less popular models with explicit latent spaces have the innate ability to manipulate text attributes by moving along latent directions. For images, properties of latent spaces are wellstudied: there exist interpretable directions (e.g. zooming, aging, background removal) and they can even be found without supervision. This success is expected: latent space image models, especially GANs, achieve state-of-the-art generation results and hence have been the focus of the research community. For language, this is not the case: text GANs are hard to train because of non-differentiable discrete data generation, and language VAEs suffer from posterior collapse and fill the latent space poorly. This makes finding interpetable text controls challenging. In this work, we make the first step towards unsupervised discovery of interpretable directions in language latent spaces. For this, we turn to methods shown to work in the image domain. Surprisingly, we find that running PCA on VAE representations of training data consistently outperforms shifts along the coordinate and random directions. This approach is simple, data-adaptive, does not require training and discovers meaningful directions, e.g. sentence length, subject age, and verb tense. Our work lays foundations for two important areas: first, it allows to compare models in terms of latent space interpretability, and second, it provides a baseline for unsupervised latent controls discovery.

1. INTRODUCTION

Transformer-based models yield state-of-the-art results on a number of tasks, including representation learning (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020) and generation (Radford et al.; Raffel et al., 2019; Lewis et al., 2020) . Notably, large language models have been reported to produce outputs nearly indistinguishable from human-written texts (Brown et al., 2020) . Although the predictions of autoregressive language models are fluent and coherent, it is not clear how to manipulate the model to get samples with desired properties. For example, make them shorter, more formal or more positive, or, alternatively, use the same model to rewrite human-written texts in a different tone. Current approaches often rely on external labels of target attributes and require modifications to the model. This involves retraining for new attributes or changing the decoding procedure, which is usually expensive. In contrast, models with explicit latent spaces have the innate ability to manipulate text attributes by moving along latent directions. They, however, gained limited traction. One reason is that training a VAE on text data poses a number of optimization challenges, which have been tackled with a varying degree of success (He et al., 2019; Fu et al., 2019; Zhu et al., 2020) . Additionally, language VAEs are mostly small LSTM-based models which goes against the current trend of using large pretrained Transformers. The first large-scale language VAE model is the recently introduced OPTIMUS (Li et al., 2020): it uses BERT as the encoder and GPT-2 as the decoder, and sets a new record on benchmark datasets. Differently from texts, latent space models for images, especially GANs, achieve state-of-the-art generation results. Therefore, these models have been the focus of the research community, and the

