TEACHING OTHERS IS TEACHING YOURSELF REGU-LARIZATION FOR CONTROLLABLE LANGUAGE MOD-ELS

Abstract

Large-scale pre-trained language models have achieved great success on natural language generation tasks. However, it is difficult to control the pre-trained language models to generate sentences with the expected attribute such as topic and sentiment. Recent efforts (Yang & Klein, 2021; Krause et al., 2021; Dathathri et al., 2019) on controllable language generation employ an additional attribute classifier, which guides the generation of large-scale pre-trained language models, have been shown to be efficient in controllable language generation. These methods are named classifier-guided language models (CGLMs). However, we find that the probabilities predicted by the attribute classifiers usually approaches 0 or 1, which makes it hard to distinguish sentences with different matching degrees to the expected attribute. The problem is named the biased probability distribution (BPD) problem. To address the problem, we investigate different methods for adjusting probability distribution and propose a Teaching Others is Teaching Yourself (TOTY) regularization method to smooth the probability distribution. Experiments on sentiment control and topic control tasks show that CGLMs can get better performance with guiding classifiers trained with TOTY.

1. INTRODUCTION

Recently, with the advances in large-scale pre-trained language model (PLM) (Radford et al., 2017; 2018; 2019; Brown et al., 2020) , great progress has been made on natural language generation tasks. With billions or even trillions of parameters, and abundant unlabeled training data, PLMs can generate diverse and realistic sentences. Formally, autoregressive PLM models the probability distribution of text X = {x 1 , x 2 , ..., x T } with the chain rule: p(X) = T i=1 p(x i |x 1 , x 2 , ..., x i-1 ). (1) However, those models are usually trained on general purpose corpus and the sentences generated by those PLMs are usually inconsistent with task requirements. Therefore, Controllable Language Generation (CLG), which aims to generate sentences that meet the requirements, has become more important in natural language generation. Controllable language generation attempts to model p(X|a) where a is a desired attribute (e.g. topic, length and sentiment): p(X|a) = T i=1 p(x i |X 1:i-1 , a). To simplify the expression, we use X 1:i to denote the sequence {x 1 , x 2 , ..., x i }. It has been found that using an attribute classifier to guide the generation of PLMs was an efficient approach to control the PLMs to generate sentences with expected attributes (Dathathri et al., 2019; Krause et al., 2021; Yang & Klein, 2021) . These methods are called classifier-guided language models (CGLMs). In CGLMs, the conditional probability at each generation step is calculated by the Bayes Rule: In the formula above, p(x i |X 1:i-1 ) is the unconditional probability of generating x i at step i, which is usually instantiated by a large-scale language model such as GPT. p(a|X 1:i ) is the attribute probability that the generation result contains the attribute a when starting with the X 1:i . CGLMs usually use the output an attribute classifier, known as guiding classifier, to model p(a|X 1:i ). However, in experiments, we found that the probability distribution of p(a|X 1:i ) predicted by guiding classifiers was usually very biased. To be specific, for most of sentences X 1:i , the probability predicted by guiding classifiers either approaches 0 or approaches 1. We call this phenomenon the biased probability distribution (BPD) problem. p(x i |X 1:i-1 , a) ∝ p(a|X 1:i )p(x i |X 1:i-1 ). In classification tasks, the BPD problem has little influence since these tasks only need to pick out the class with the highest probability. In other words, the concrete value of probability does not have a direct impact on classification accuracy. However, in CGLMs, the attribute of many sentences could be ambiguous. Especially for autoregressive models, the tokens are generated one after another, so the classifiers need to predict the attribute probability of many incomplete sentences. Obviously assigning a probability approaching to 0 or 1 to these sentences is unreasonable. For example, for the following sentences, the probabilities of positive sentiment predicted by the GRU classifier trained on the IMDB dataset (Maas et al., 2011) A good guiding classifier for CGLMs should distinguish sentences with different matching degrees to the expected attribute, meaning that we should smooth the probability distribution predicted by the classifier. The existing methods (Wang et al., 2021; Gupta & Ramdas, 2021; Platt, 2000; Wei et al., 2022; Zadrozny & Elkan, 2001; Szegedy et al., 2016; Müller et al., 2019) for adjusting the probability distribution predicted by classifiers are usually used to address the mismatch between a model's confidence and its correctness. It will be shown that these methods does not significantly smooth the probability distribution.



Figure 1: The probability distribution of positive sentiment predicted by different classifiers trained on the IMDB dataset. (a) GRU classifiers trained without/with TOTY regularization. (b) GPT2 classifiers trained without/with TOTY regularization.

for a), b), and c) was 89.5%, 98.5%, 99.9%, respectively. However, only c) has a clear positive sentiment, while a) and b) do not have a clear sentiment. a) This tale takes place in b) This tale takes place in the Namib Desert of Africa. c) This impressive tale takes place in the Namib Desert of Africa.

