TEXTSETTR: LABEL-FREE TEXT STYLE EXTRACTION AND TUNABLE TARGETED RESTYLING

Abstract

We present a novel approach to the problem of text style transfer. Unlike previous approaches that use parallel or non-parallel labeled data, our technique removes the need for labels entirely, relying instead on the implicit connection in style between adjacent sentences in unlabeled text. We show that T5 (Raffel et al., 2020), a strong pretrained text-to-text model, can be adapted to extract a style vector from arbitrary text and use this vector to condition the decoder to perform style transfer. As the resulting learned style vector space encodes many facets of textual style, we recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input text while preserving others. When trained over unlabeled Amazon reviews data, our resulting TextSETTR model is competitive on sentiment transfer, even when given only four exemplars of each class. Furthermore, we demonstrate that a single model trained on unlabeled Common Crawl data is capable of transferring along multiple dimensions including dialect, emotiveness, formality, politeness, and sentiment.

1. INTRODUCTION

There has been a recent surge of interest in text style transfer, with the aim of training models able to modify specific attributes of input text (e.g., sentiment or formality) while preserving the remaining content. For example, a sentiment transfer model might transform the input "best book ever!" into "worst book ever!", while a formality transfer model might change the same input into "This is the best book I have ever read." Work in this area falls into three categories. Supervised approaches like Jhamtani et al. (2017) transfer between pre-selected styles, and rely on aligned parallel training data to teach the model the desired input/output correspondence. This method is limited by the availability of parallel corpora. So-called "unsupervised" approaches like Li et al. (2018) and Lample et al. (2019) remove the requirement for parallel data, but still require labeled training examples of each style, and are limited to transfer between a pre-specified set of styles. Label-free approaches like the recent Xu et al. (2020) remove the need for any training labels. While the most technically challenging, this offers the potential for transferring between arbitrary styles at inference time and has significant value, as curated datasets are not available for many style attributes. In this work, we explore the hypothesis that large pretrained text-to-text models like T5 (Raffel et al., 2020) already contain a strong representation of textual style, which can be extracted and used to condition the decoder of a style transfer model through a relatively lightweight fine-tuning procedure. To isolate style information in the absence of labels, we rely on the observation that style is a "slow-moving" feature, which tends to be consistent over large spans of text. Specifically, given two adjacent sentences from an unlabeled corpus, we train our model to extract a "style vector" from the first and use that vector to perform denoising and other reconstruction tasks on the second. This technique extends the unsupervised approach of Lample et al. ( 2019) to the label-free setting, and allows us to reformulate the style transfer operation as a directional operation in style vector space using the difference between target and source style vectors; we call this "targeted restyling". When combined with a novel "tunable inference" technique for controlling token add/delete rates, this gives our final model: Text Style Extraction and Tunable Targeted Restyling (TextSETTR). During training, the model reconstructs a corrupted input, conditioned on a fixed-width "style vector" extracted from the preceding sentence. At inference time, a new style vector is formed via "targeted restyling": adding a directional delta to the extracted style of the input text. Stochastic tuning ranges provide extra conditioning for the decoder, and enable fine-grained control of inference. Our main contributions are to: (1) demonstrate the viability of label-free style transfer,foot_0 (2) use sentence adjacency as a means for inducing text style representations, (3) reframe style transfer as "targeted restyling" directional operations in style space, (4) introduce "tunable inference" for finergrained control of transfers, (5) show the effectiveness of "noisy" back-translation training, and (6) illustrate few-shot generalization to a range of style attributes including dialect, emotiveness, formality, politeness, and sentiment.

2. METHOD

Figure 1 illustrates our proposed TextSETTR architecture. At a high level, our approach follows Lample et al. ( 2019), who train a denoising auto-encoder conditioned on a fixed-width style vector. The key difference in our case is that the true style is unknown at training time. To overcome this, we jointly train a "style extractor" component to induce a useful style representation (that can aid in reconstruction) from text in the nearby context. We describe this in more detail below.

2.1. MODEL ARCHITECTURE

We conduct our experiments using a modified version of the Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020) . Like T5, our model includes a transformer-based encoder and decoder. As in T5 pretraining, the input to the encoder is a corrupted/noised version of the target, resulting in a reconstruction task. Our goal is to design a type of corruption that results in this training task resembling style transfer, despite the lack of labeled training data. Our core addition to T5 is the style extractor. Based on the encoder's architecture, this component's input is an uncorrupted sentence in the same style as the target; relying on our assumption that style is a slow-moving feature, we use the sentence preceding the target (the "context") for this.foot_1 This encourages extracting a style representation that is useful for repairing the corrupted input. The only architectural difference between the encoder and style extractor is that we mean-pool the style extractor's hidden state sequence into a single fixed-width vector (the "style vector"); in our experiments, the dimensionality of this vector and the encoder hidden states is 1024. To incorporate the style vector into the rest of the model, we simply add it to each of the final encoder hidden states.



Our work is concurrent withXu et al. (2020), who offer a substantially different approach to label-free style transfer, as discussed in Sections 3 and 5. This approach is similar to the use of adjacent sentences for weak supervision in Devlin et al. (2019) and Zhang et al. (2020).



Figure 1: TextSETTR architecture for label-free style transfer. The Encoder, Decoder and Style Extractor (Ex) are transformer stacks initialized from pretrained T5.During training, the model reconstructs a corrupted input, conditioned on a fixed-width "style vector" extracted from the preceding sentence. At inference time, a new style vector is formed via "targeted restyling": adding a directional delta to the extracted style of the input text. Stochastic tuning ranges provide extra conditioning for the decoder, and enable fine-grained control of inference.

