COCON: A SELF-SUPERVISED APPROACH FOR CONTROLLED TEXT GENERATION

Abstract

Pretrained Transformer-based language models (LMs) display remarkable natural language generation capabilities. With their immense potential, controlling text generation of such LMs is getting attention. While there are studies that seek to control high-level attributes (such as sentiment and topic) of generated text, there is still a lack of more precise control over its content at the word-and phrase-level. Here, we propose Content-Conditioner (CoCon) to control an LM's output text with a content input, at a fine-grained level. In our self-supervised approach, the CoCon block learns to help the LM complete a partially-observed text sequence by conditioning with content inputs that are withheld from the LM. Through experiments, we show that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner.

1. INTRODUCTION

Transformer-based (Vaswani et al., 2017; Tay et al., 2020) pretrained language models (LMs) have led a wave of new advances in natural language processing tasks as a means to extract contextualized word embeddings (Devlin et al., 2018; Dai et al., 2019b; Yang et al., 2019) and as text generators (Radford et al., 2019; Brown et al., 2020) . These LMs are trained on huge amounts of text corpora to predict next tokens through a log-likelihood objective. Given its remarkably fluent text generation, there is growing interest in controlling output texts of such LMs (Keskar et al., 2019; Dathathri et al., 2019) . Approaches like training a modified LM from scratch to incorporate target text attributes (Keskar et al., 2019) can be expensive while finetuning pretrained LMs for specific attributes (Ziegler et al., 2019) limits the scope of text control. Without changing the architecture or weights of pretrained LMs, one promising approach (PPLM) (Dathathri et al., 2019) controls generated text through attribute models. Though effective in controlling high-level text attributes such as topic and sentiment, the same target attribute may generate text samples with vastly different content at the word-and phrase-levels, leaving a gap for more fine-grained control over the content of LM-generated texts. We conceptualize Content-Conditioner (CoCon) as an approach to narrow this gap by guiding pretrained LMs' text outputs through the incorporation of content input. This content input can take the form of a text sequence whose content we would like to condition on for text generation. Essentially, CoCon comprises two parts: 1) a pretrained LM and 2) a interleave CoCon layer. By employing a pretrained LM, CoCon incorporates the representations of a content input into the encoded text representations through the CoCon layer before passing the content-conditioned representations into LM β for generation. To train the CoCon block, we propose a self-supervised learning approach where training data consist of text samples generated by the pretrained LM itself ( § 3.1). By splitting each text sequence into two segments ([x a ; x b ]), CoCon learns through a self reconstruction objective to help the LM reconstruct missing latter segments (x b ) by taking x b itself as the content input. We use content masking for CoCon and also propose other loss functions such as cycle reconstruction to condition content from divergent sources while producing high-quality texts. Since the CoCon block's size is a small fraction of the LM and no finetuning is conducted on the LM's weights, the training cost is significantly lower than training an LM from scratch. We show that CoCon's fine-grained content control can be extended to also influence higher-level text attributes such as topic and sentiment in a zero-shot manner, and compare it with strong controlled generation baselines. Furthermore, CoCon is versatile in assimilating multiple content inputs, and its strength of content-conditioning can be flexibly adjusted through a content bias term during inference. In this paper, we demonstrate the CoCon approach with the GPT-2 345M model (Radford et al., 2019) as the pretrained LM. Given CoCon's modular nature, it can be used with other Transformer-based LMs or even other controlled generation methods. All in all, the core contributions of this paper are: • We propose CoCon for content-conditioned language generation. • We introduce a self-supervised learning approach where CoCon learns to complete text sequences when given information about future tokens. et al., 2017; Hu et al., 2017; Yang et al., 2018) . This disentanglement enables style changes to the text at the latent space while retaining most of its content. Another work identifies attribute markers (Li et al., 2018) which are n-grams correlated with a particular style in a text corpus and edit texts' style by substituting them. Essentially, style transfer alters existing texts rather than generating texts and requires predefined attributes.

3. CONTENT CONDITIONER (COCON)

In the following sections, we discuss the motivation for CoCon, its model architecture and how we train the CoCon block. Motivation In text generation with language models, given the prompt text x :t-1 = {x 1 , . . . , x t-1 }, the following text {x t , . . . , x l } is generated in an auto-regressive manner (Man-



Through ablation studies and comparisons with strong baselines like PPLM and CTRL(Keskar et al., 2019), we investigate how CoCon controls high-level attributes such as topic and sentiment while generating texts that have high content similarity to conditioning text.2 RELATED WORKThere is a line of work that aims to generate output text of desired attributes with neural networks. Some of the earliest efforts involve conditional generative models(Kikuchi et al., 2016; Ficler &  Goldberg, 2017)  where the networks are trained on text data labeled with the target attributes. These models can be trained via reinforcement learning(Ziegler et al., 2019)  or the generative adversarial network(Yu et al., 2017)  framework. Unlike CoCon, the requirement of predetermined attributes in those methods limits the possible types of generated texts. CTRL(Keskar et al., 2019)  is a recent approach that generated controlled fluent texts through the use of control codes which are meta-data prepended to the text during generation. Though it produces high-quality text with its GPT-2-like architecture, its control codes are also predetermined during the training. Closest to our work is Plug and Play Language Model (PPLM)(Dathathri et al., 2019)  which seeks to control text on already pretrained LM without finetuning through relatively small 'pluggable' attribute models. While PPLM's flexible design also enables controlled generation without retraining or finetuning the LM like in CoCon, our approach aims to control the generation at a content level, beyond high-level text attributes. Another core difference lies in the training where CoCon's self-supervised learning absolves the need for labeled data, such as the ones employed to train PPLM's attribute discriminator models. Weighted decoding(Ghazvininejad et al., 2017; Holtzman et al., 2018)  seeks to control the output text token by upweighting the probabilities of targeted words during the decoding step but has been shown to produce incoherent text(See et al., 2019). Conditioning language generation has been used in question generation to enhance faithfulness by attending to textual context such as predicates, subject types or object types(Elsahar et al., 2018)  rather than the content input used here in CoCon. Small adapter layers(Bapna et al., 2019)  have been previously proposed for multilingual translation to also save on model size and training resources but differ from CoCon's self-supervised training as they rely on annotated sentence pairs of different languages for training.Text style transfer is a related area that controls texts' attributes by translating text from one style to another(Dai et al., 2019a). A few of such studies employ auto-encoders to separate texts' style and non-style latent representation (Shen

