TEACHING OTHERS IS TEACHING YOURSELF REGU-LARIZATION FOR CONTROLLABLE LANGUAGE MOD-ELS

Abstract

Large-scale pre-trained language models have achieved great success on natural language generation tasks. However, it is difficult to control the pre-trained language models to generate sentences with the expected attribute such as topic and sentiment. Recent efforts (Yang & Klein, 2021; Krause et al., 2021; Dathathri et al., 2019) on controllable language generation employ an additional attribute classifier, which guides the generation of large-scale pre-trained language models, have been shown to be efficient in controllable language generation. These methods are named classifier-guided language models (CGLMs). However, we find that the probabilities predicted by the attribute classifiers usually approaches 0 or 1, which makes it hard to distinguish sentences with different matching degrees to the expected attribute. The problem is named the biased probability distribution (BPD) problem. To address the problem, we investigate different methods for adjusting probability distribution and propose a Teaching Others is Teaching Yourself (TOTY) regularization method to smooth the probability distribution. Experiments on sentiment control and topic control tasks show that CGLMs can get better performance with guiding classifiers trained with TOTY.

1. INTRODUCTION

Recently, with the advances in large-scale pre-trained language model (PLM) (Radford et al., 2017; 2018; 2019; Brown et al., 2020) , great progress has been made on natural language generation tasks. With billions or even trillions of parameters, and abundant unlabeled training data, PLMs can generate diverse and realistic sentences. Formally, autoregressive PLM models the probability distribution of text X = {x 1 , x 2 , ..., x T } with the chain rule: p(X) = T i=1 p(x i |x 1 , x 2 , ..., x i-1 ). (1) However, those models are usually trained on general purpose corpus and the sentences generated by those PLMs are usually inconsistent with task requirements. Therefore, Controllable Language Generation (CLG), which aims to generate sentences that meet the requirements, has become more important in natural language generation. Controllable language generation attempts to model p(X|a) where a is a desired attribute (e.g. topic, length and sentiment): p(X|a) = T i=1 p(x i |X 1:i-1 , a). (2) To simplify the expression, we use X 1:i to denote the sequence {x 1 , x 2 , ..., x i }. It has been found that using an attribute classifier to guide the generation of PLMs was an efficient approach to control the PLMs to generate sentences with expected attributes (Dathathri et al., 2019; Krause et al., 2021; Yang & Klein, 2021) . These methods are called classifier-guided language models (CGLMs). In CGLMs, the conditional probability at each generation step is calculated by the Bayes Rule: p(x i |X 1:i-1 , a) ∝ p(a|X 1:i )p(x i |X 1:i-1 ). (3) 1

