A DISTRIBUTIONAL APPROACH TO CONTROLLED TEXT GENERATION

Abstract

We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise'" and "distributional" constraints over the target LM -to our knowledge, the first model with such generalitywhile minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence.

1. INTRODUCTION

Neural language models, such as GPT-2/3 (Radford et al., 2019; Brown et al., 2020a) , pretrained on huge amounts of text, have become pre-eminent in NLP, producing texts of unprecedented quality. In this paper, we are concerned with the problem of controlling a generic pretrained LM in order to satisfy certain desiderata. For instance, we may want to avoid toxic content; prevent certain demographic biases; or steer generations towards a certain topic or style. Prior work, taking inspiration from Reinforcement Learning (RL), has aimed at inducing autoregressive models to optimize global objectives using task specific rewards such as BLEU and ROUGE for Machine Translation and Summarization (Ranzato et al., 2016; Bahdanau et al., 2017) , or hand crafted rewards (Li et al., 2016b; Tambwekar et al., 2019) to improve certain a priori desirable features. However, such an optimization process is not infallible; Liu et al. (2016a) noted that it often leads to "degeneration", producing poor examples that improve the average reward but forgo coherence and fluency. This degeneration is often diagnosed as an effect of deviating too much from the original pretrained LM during optimization. Consequently, prior work has regarded proximity to the pretrained model as a prescription for sample quality. This view is most prominent in open-domain generation where no gold references are available for fine-tuning, making the pretrained LM itself the yardstick for fluency. Jaques et al. (2017); Ziegler et al. (2019) propose a conservative fine-tuning approach moderated by a KL penalty between the trained policy and the original LM, discouraging large deviations. A KL penalty was also used by Dathathri et al. (2020) , this time in a plug-and-play rather than a fine-tuning context. However, the authors show that balancing policy deviations from the original LM while also satisfying the control conditions is delicate. To combat degeneration they had to combine the KL penalty with post-norm fusion, reranking, and early-stopping procedures. Most of the existing work on Controlled Generation has taken what we refer to as a "pointwise" view, namely focusing on the quality of each individual output, a view that is encouraged by the standard RL goal of maximizing rewards computed at the individual level. Such techniques are incapable of enforcing "distributional" conditions, where some collective statistical properties are desired over the set of all generations. Distributional control is key to solving the problem of social biases in LMs trained on large, uncurated Web corpora. Those LMs -dubbed "Stochastic Parrots" in (Bender et al., 2021) -tend to encode hegemonic biases that are harmful to marginalized populations. There has been a large body of work analysing these distributional biases (Blodgett et al., 2020; Stanovsky et al., 2019; Prates et al., 2020; Sheng et al., 2019a; Brown et al., 2020b) . However, applying distributional control on pretrained models is still an understudied problem. Sheng et al. (2020) introduce a method relying on adversarial triggers (Wallace et al., 2019) ; this method does not de-bias the whole distribution but only obtains non-biased continuations of given prompts. Bordia & Bowman (2019) introduce a regularization term for reducing gender bias when training a language model from scratch (as opposed to de-biasing a pretrained model). 2In this work, we present our Generation with Distributional Control (GDC) approach, in which we formalize the problem of controlled text generation as a constraint satisfaction problem over the probability distribution p representing the desired target LM. Namely, we require the expectations ("moments") relative to p of certain output features to have specific values; this permits for instance to condition all outputs to speak about sports (a pointwise constraint), and 50% of them to mention female characters (a distributional constraint). Additionally, we require p to have a minimal KL divergence D KL (p, a) from the original pretrained LM a. This has the effect that p now inherits favorable linguistic qualities from a. As we will explain, this formulation is a generalization of the Maximum Entropy Principle and leads to a unique solution P (x). P (x) is an unnormalized distribution, aka an Energy-Based Model (EBM) (Hinton, 2002; LeCun et al., 2006; Bakhtin et al., 2020) , of which p(x) = 1/Z P (x) is the normalized version, where Z . = x P (x) is the partition function of P . Computing the EBM representation P is a crucial step, as it fully determines the optimal distribution p we are looking for. However, it is not the end of the story, because the representation thus obtained does not enable us to directly sample from p, an essential property of any LM. 3 To this end, we introduce KL-adaptive DPG (Distributional Policy Gradient), a variant of an algorithm recently proposed in (Parshakova et al., 2019b) . We train the policy π θ to approximate p in an adaptive way, by speeding up the next round of approximations based on approximations previously obtained. At the end of this process, we obtain a final π θ , our target LM, on which we can estimate diverse metrics, including D KL (p, π θ ), measuring the approximation quality of π θ relative to the optimal p, and D KL (π θ , a), measuring the divergence of π θ relative to the original LM a. This two-step approach differs from much research in NLP-oriented work with EBMs, which tends to use EBM representations inside the training loops of neural networks, blurring different dimensions of the problem. By contrast -similarly to Parshakova et al. (2019a; b) in a different context -we clearly decouple the relatively simple problem of determining a "pivot" optimal EBM from the more difficult problem of exploiting this EBM at inference time, Such decoupling is valuable, because it permits to better diagnose the important challenges to focus on. Overall, our contributions can be summarized as follows: 1. We introduce a Distributional View for controlled text generation formalized as a constraint satisfaction problem combined with a divergence minimization objective, providing a single framework both for "distributional" constraints (collective statistical requirements) and for "pointwise" constraints (hard requirements on each individual) ( §2.1). To our knowledge, this is the first framework with such generality for controlled text generation. 2. We show how these constraints lead to an optimal EBM for the target model ( §2.2), propose the KL-Adaptive DPG algorithm for approximating the optimal EBM distribution by The Generalized MaxEnt specification (left panel) is looking for a distribution p that lies on the moment constraints manifold C and that minimizes the forward KL DKL(p, a). The solution is provided by Information Geometry: (1) build the exponential family E determined by a and φ, (2) p lies at the intersection between C and E, (3) for any distribution c satisfying the constraints, the "Pythagorean identity" holds: DKL(c||a) = DKL(c||p) + DKL(p||a); in particular p is unique. an autoregressive policy ( §2.3), and show the effectiveness of this adaptive technique for obtaining faster convergence ( §B.2). 3. We conduct experiments in a number of pointwise and distributional conditions, assessing results in terms of divergence from GPT-2, fluency and diversity, with better performance than strong baselines. The distributional experiments show the potential of our approach as a remedy to the current and important problem of bias in pretrained language models, providing a novel direction for addressing it ( §3).

2. FORMALIZATION

We denote by X the set of all sequences x of bounded length L max , by a the initial pretrained model and by p the desired target model. The probabilities of x according to each model are a(x) and p(x). Our approach consists in expressing our desiderata through constraints on the desired values μi of the expectations (aka moments) µ i . = E x∼p φ i (x) of certain predefined real-valued feature functions φ i (x), for i ∈ {1, . . . , k}. To illustrate, the previous example can be expressed by using two binary features, φ 1 (x) = 1 iff x is classified as speaking about sports, φ 2 (x) = 1 iff x mentions a female character. Then our "moment constraints" take the following form: µ 1 = E x∼p φ 1 (x) = 1.0, µ 2 = E x∼p φ 2 (x) = 0.5. The first (pointwise) constraint implies that each individual x has to speak about sports (otherwise µ 1 could not reach its maximum value 1.0), the second (distributional) constraint that 50% of the x's have to mention a female character. 4Let C be the set of all distributions c over X that satisfy the moment constraints. We then propose to specify p as a distribution respecting the constraints, but also minimizing KL divergence from a: p . = arg min c∈C D KL (c, a), Equation ( 1) is a generalization of the Maximum Entropy Principle of Jaynes (1957) , which corresponds to the limit case where a is the uniform u distribution over X, noting that minimizing D KL (c, u) is equivalent to maximizing the entropy of c under the constraints -in other words, trying to find the least "specific" distribution satisfying the constraints.

2.1. CONSTRAINTS, INFORMATION GEOMETRY, EXPONENTIAL FAMILIES

To recap our formal approach, we have a finite set X, a distribution a over X s.t. a(x) > 0, ∀x ∈ X, and real functions φ 1 , ..., φ k over X. We specify moment constraints µ i = μi on distributions c over X, where µ i . = E x∼c φ i (x) and the μi 's are given targets; the set of distributions satisfying these constraints is denoted by C. Our Problem is to find a p such that p = arg min c∈C D KL (c, a). We follow Csiszár & Shields (2004) on this question, a problem that is at the core of the field of Information Geometry (Nielsen, 2018; Amari & Nagaoka, 2000) . Under the assumption that C = ∅, they prove the following result (also see §A.1): Theorem 1 (A) There exists a unique solution p to the problem above, obtained as p(x) ∝ P (x) where P is in exponential family form: (x) . P (x) = a(x) 1[x ∈ X C ] e i λiφi (2) In other words p(x) = 1/Z P (x), with Z = x∈X P (x); P is an unnormalized distribution, i.e. an EBM. Here X C = {x ∈ X| ∃c ∈ C s.t. c(x) > 0} is the "support set" associated with C. The λ i 's are real numbers called the natural parameters associated with the moments µ i . (B) p can be approximated to arbitrary precision by distributions p of the form: p (x) ∝ a(x) e i λ ,iφi (x) (3) for appropriate real values of the λ ,i . (C) p satisfies the Pythagorean Identity: D KL (c, a) = D KL (c, p) + D KL (p, a), ∀c ∈ C (see Fig 1) . The advantage of this version of the connection between Generalized Maximum Entropy and Exponential Families is its generality, which distinguishes it from other presentations, and which makes it ideal for unified application to pointwise, distributional or hybrid constraints. In the special case of only pointwise constraints, of the form E x∼c φ i (x) = 1.0, i ∈ [1, k] , with φ i (x) ∈ {0, 1}, let's define the predicate b(x) to be 1 iff x satisfies all the constraints. Then, using the (A) form of the result, it is an easy exercise (see §A.2) to prove that X C = {x ∈ X| b(x) = 1} and that one has p(x) ∝ a(x)b(x). In this case P (x) = a(x)b(x) is a very simple EBM that does not involve an exponential part; this is the EBM form that we use for experiments involving only pointwise constraints. In the general case where some constraints are distributional, the determination of X C is not as direct, and we prefer to use the approximation provided by (B), which permits a generic implementation. With only distributional constraints, an exact solution is typically obtained with finite λ's. With hybrid constraints, some of the λ's may tend to infinite (positive or negative) values but thresholding them suffices to get a good approximation.

2.2. FROM MOMENT CONSTRAINTS TO EBM Algorithm 1 Computing λ

Input: a, features φ, imposed moments μ 1: sample a batch x1, . . . , xN from a 2: for each j ∈ [1, N ] : wj(λ) ← e λ•φ(x j ) 3: μ(λ) ← Let's now consider a set of desired moment constraints μ. 5 In the general case (i.e., when some constraints are distributional), we use Theorem 1.(B), which says that the desired energy-based model P can be approximated arbitrarily closely in the following form: P (x) . = a(x)e λ•φ(x) . This EBM defines the desired normalized distribution p(x) . = P (x) Z , where Z . = x P (x). What is left is to learn appropriate values for the parameter vector λ s.t.: E x∼p φ(x) μ. (5) We address this problem through Algorithm 1. First, we sample a large number N of sequences x 1 . . . x j . . . x N from a. On line 2, we define "importance weights" w j (λ) . = P (xj ) a(xj ) = exp λ, φ(x j ) . On line 3, we then use SNIS (Self Normalized Importance Sampling) (Kim & Bengio, 2016; Parshakova et al., 2019a) to estimate µ(λ) . = E x∼p φ(x). SNIS consists in computing: μ(λ) = N j=1 w j (λ) φ(x j ) N j=1 w j (λ) , and it can be shown that μ(λ) µ(λ), with convergence in the limit (Owen, 2013) . Note that the estimate μ(λ) is obtained not as a single number, but as a parametric function of the variable λ. We want to find λ such that μ(λ) = μ, a question that we handle on line 4 by performing an SGD optimization over the objective min || μμ(λ)|| 2 2 . 6 At the end of this process, we obtain an estimated value for the parameter vector λ, and a representation P (x) = a(x) exp λ, φ(x) . While a(x) is a normalized distribution by construction, the introduction of the second factor loses this normalization property, making P (x) an EBM. 7 8 2.3 FROM EBM TO AUTOREGRESSIVE POLICY Algorithm 2 KL-Adaptive DPG Input: P , initial policy q 1: π θ ← q 2: for each iteration do 3: for each episode do 4: sample x from q(•) 5: θ ← θ+α (θ) P (x) q(x) ∇ θ log π θ (x) 6: if DKL(p||π θ ) < DKL(p||q) then 7: q ← π θ Output: π θ The EBM representation just obtained for P defines the optimal p = Z -1 P unambiguously, a crucial intermediate step in the solution of our problem. From it we can immediately compute ratios of the form p(x)/p(x ) for two sequences x, x , but without knowing Z, we cannot compute p(x) and, even with such a knowledge, we cannot produce samples from p. This problem is typical of EBMs at large: they provide a rich and flexible mechanism for specifying models, but they leave a gap between representation and exploitation. A range of techniques, from sophisticated MCMC approaches (especially for continuous models in vision) to contrastive learning techniques, have been developed for bridging this gap. One technique that is suitable for our objective here, namely sampling from a sequential EBM that includes an autoregressive component a(x), is the DPG ("Distributional Policy Gradient") algorithm (Parshakova et al., 2019b) . The objective of DPG is to obtain an autoregressive policy π θ that approximates p, where approximation is formalized in terms of making the cross-entropy CE(p, π θ ) =x p(x) log π θ (x) as small as possible. 9 DPG exploits the fact that, for any "proposal" distribution q whose support contains the support of p, we have ∇ θ CE(p, π θ ) = -∇ θ E x∼p log π θ (x) = -E x∼p ∇ θ log π θ (x) = -E x∼q p(x) q(x) ∇ θ log π θ (x) where the last equality is an instance of importance sampling. Our "KL-adaptive" version of DPG is shown in (Algorithm 2). We start from an input EBM P , along with an initial policy q which is a proxy to p; in our case we take q = a. During an iteration (think minibatch or set of minibatches), we sample a number of sequences from q, do an SGD update of θ (line 5), where P is used instead of p (noting that they only differ by a multiplicative constant), and where α (θ) is a learning rate. The efficiency of the algorithm is related to how close the proposal q is to the target p, 10 The algorithm is adaptive in the sense that it modifies q periodically to take advantage of the evolving approximations π θ . On line 6, we we test whether the current π θ is closer 6 µ(λ) can approximate μ arbitrarily closely, and we know from SNIS theory that with increasing N , μ(λ) will become arbitrarily close to µ(λ). In our experiments we stop the SGD optimization when || μμ(λ)|| 2 2 becomes smaller than 0.01. 7 The class of Energy-Based Models (EBMs) (LeCun et al., 2006) is much larger than the exponential family models we are considering in this paper. An EBM P (x) is just any unnormalized distribution over an input space X, in other words a mapping P from X to the non-negative reals. The terminology comes from physics, and corresponds to writing P (x) in the form P (x) = e -E(x) , E being called the "energy" associated with x. 8 A question was raised by an anonymous reviewer about the viability of adding new constraints incrementally. The answer is yes, more details provided in the Appendix, §A.3. 9 This is equivalent to minimizing DKL(p, π θ ) = CE(p, π θ ) -H(p). 10 In the limit where q were equal to p, the algorithm would be identical to standard supervised training, except that samples would be obtained directly from the underlying process p rather than a training set of samples. than q to p in terms of KL-divergence, and if so we update q to π θ on line 7. 11 §B.2 provides an ablation study showing the effectiveness of this adaptive step for obtaining faster convergence.

3. EXPERIMENTS, RESULTS, AND EVALUATION

In this section we describe our evaluation methodology and perform experiments on pointwise constraints ( §3.2) and on distributional and hybrid constraints ( §3.3). The Appendix contains a detailed view of evaluation ( §H), comparison with extra baselines ( §D.2), and an ablation study ( §B.2).

3.1. EVALUATION METRICS

The main metrics we report are: (1) E x∼π θ φ i (x), assessing the ability of π θ to reach the expectation goal on the i-th constraint, (2) D KL (p||π θ ), the forward KL divergence from the optimal distribution (which should be as close to 0 as possible), (3) D KL (π θ ||a), the reverse KL divergence from the original GPT-2; for details on the estimation of these metrics see §B.1. Previous work has mostly focused on the diversity of each individual output using Dist-1,2,3 scores (Li et al., 2016a) to measure repetitions within a single generated sequence. However, the shortcomings in terms of sample diversity, of optimization techniques when training generative models for text, has recently been documented in (Caccia et al., 2020) . So additionally, we report Self-BLEU-3,4,5 (Zhu et al., 2018) to measure repetitions at a distributional level across the whole set of generated samples, and also provide a token/type frequency analysis (see Fig. 4 and §H.4). Note that KL divergence from the original GPT-2 also implicitly captures sample diversity: a distribution that focuses all its probability mass on a few sequences typically displays high divergence from GPT-2. Implementation details and hyper-parameters are available in the Appendix ( § F).

3.2. POINTWISE CONSTRAINTS EXPERIMENTS

Pointwise constraints are of the form E p φ i (x) = 1, with φ i a binary feature. Contrarily to distributional constraints, they can be directly associated with a "reward", namely φ i itself. RL-inspired baselines can then be introduced naturally, and this is what we do here. Single-Word constraints: Here we constrain the presence of a specific word w in the generated text i.e. φ(x) = 1 iff w appears in the sequence x. We use 9 single-word constraints of different rarity levels: "US" (original frequency: 7•10 -3 ), "China" (4•10 -3 ), "Canada" (2•10 -3 ), "amazing" (1•10 -3 ), "Paris" (5•10 -4 ), "restaurant" (6•10 -4 ), "amusing" (6•10 -5 ), "Vampire" (9•10 -5 ), "Wikileaks" (8•10 -5 ). Word-list constraints: We use 4 different word lists among those proposed in (Dathathri et al., 2020) , covering the following topics: "kitchen", "fantasy", "politics", and "computers". We set φ l (x) = 1 if x contains at least one one word from the word list l. Classifier-based constraints: We use pre-trained classifiers from (Dathathri et al., 2020) , which consist of a linear head on top of GPT-2. We select 4 classes and define corresponding pointwise constraints: "very positive", "positive", "very negative" and "Clickbait". See §F for details on constraint computations. Baselines: We compare our method GDC to three baselines: (1) REINFORCE (Williams, 1992b) , using the reward φ(x), i.e. trying to maximize E π θ φ(x); (2) REINFORCE P(x) : Reinforce again, but now using the reward P (x) based on our energy model P , i.e. maximizing E π θ P (x); this baseline starts from the same optimal EBM P representation as GDC but with a standard optimization objective rather than a distributional one; in other words, while GDC tries to get a similar sampling distribution to p, this baseline tries to get sequences of maximal probability p(x). (3) ZIEGLER (Ziegler et al., 2019) : an approach relying on the RL Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) and which tries to maximize the objective E π θ φ(x) -βD KL (π θ , a), which interpolates the reward φ(x) with a KL-divergence penalty from the pretrained model, but where the goal is not explicitly to satisfy a constraint; for a geometric illustration of the differences with 11 In the original DPG, the superiority test is done on the basis of the log-likelihood on a validation set. Here we are in the more demanding situation where no validation set is available. To directly estimate the KL divergence from p (line 6), we exploit the identity DKL(p π) = -log Z + 1/Z E x∼q(x) P (x) q(x) log P (x) π(x) . See §B.1 for derivations and a comparison with using Total Variation Distance (TVD) for assessing divergence. In the case of ZIEGLER we can see a positive effect of the interpolation factor β between the reward and the KL penalty in the objective function. In the aggregated experiments reported here, the reward is slightly better than with GDC, but with inferior diversity scores (see also Fig. 4 , showing that GDC produces richer vocabulary), and the stability is much worse (a detailed view of each experiment is provided in §H, showing more clearly the instability of this baseline). A complementary evaluation is provided by Figure 3 , focusing on the ability of π θ to converge to the optimal distribution p. We see that GDC is superior to all baselines in terms of D KL (p π θ ) and also much more stable. In summary, in these experiments, we see that with GDC the constraint expectation E π θ φ(x) smoothly increases while π θ maintains the lowest divergence from GPT-2, becomes closest to the optimal p, and has the best diversity scores overall. On the other hand, we also note that at the point where we stop training (30K steps), the average over experiments of E π θ φ(x), while still increasing, does not reach 100%, an issue that we discuss at the end of the paper ( §4).

3.3. DISTRIBUTIONAL AND HYBRID CONSTRAINTS EXPERIMENTS

As formalized in §2, GDC permits to define pointwise and distributional constraints as well as any mix between them. This unique feature makes it very suitable to remedy biases that the text generation model may have, a problem identified in several previous works (Sheng et al., 2019b) . We employ GDC to balance gender and profession distributions across biographies generated by a GPT-2 model fine-tuned on Wikipedia Biographies (Lebret et al., 2016) (henceforth GPT-2 bio ) ( §G gives additional details). The bias in GPT-2 bio is significant: we calculated that this model generates only around 7% female biographies. It also displays a large imbalance between professions related to "Science" (1.5%), "Art" (10.0%), "Business" (10.9%) and "Sports" (19.5%). Experiment 1: Single Distributional Constraint We use the distributional constraint E x∼p φ f emale (x) = 0.5; GDC is able to reduce the bias of GPT-2 bio to obtain 35.6% female biographies rather than only 7.4% (see Fig. 2 for this experiment and the next ones). Experiment 2: Multiple Distributional Constraints We then test our framework with several distributional constraints of different values and control directions. We specify four distributional constraints all at once with the goal of increasing the expectations of "science" and "art" to 40% and decreasing those of "sports" and "business" to 10%. GDC is able to increase the expectations of the first two professions respectively from 1.5% to 20.3% and from 10 to 31.6% and to decrease those of "business" and "sports" respectively from 10.9% to 10.2% and from 19.5% to 11.9%, reaching expectations close to the desired ones for all features using a single training method. Experiments 3,4,5,6: Hybrid Constraints Here we want to de-bias the model as in the previous case but we single out biographies of scientists, artists, etc. Formally, our requirements become E x∼p φ prof ession (x) = 1.0, a pointwise constraint, and E x∼p φ f emale (x) = 0.5, a distributional constraint. In those 4 hybrid experiments we can clearly see that GDC can address both pointwise and distributional constraints increasing each simultaneously with just the right amount to reach the desired expectations. Appendix §G further elaborates Fig. 2 (convergence curves).

4. DISCUSSION

Our approach to controlled text generation is distinguished by its breadth -the first one to handle distributional along with pointwise constraints, with applications to the important problem of Bias in pretrained LMs -and by the transparency of the supporting formalism. It decouples the training objective along two different dimensions. The first consists in solving the initial constraints specification, and leads through a direct algorithm to an optimal solution in EBM format. The second, where the real computational difficulty lies, consists in approximating this EBM with an autoregressive policy for use at inference time. Sampling from an EBM is an important, hard, and well-identified challenge in the literature. Our approach there consists in proposing a KL-adaptive version of the DPG algorithm, which exploits ascertained improvements of the trained policy to speed up convergence. This is an effective method for rare events, as we show in an ablation study ( §B.2). In the case of pointwise constraints, where comparisons with baselines can be done, our experiments show the Our method does not suffer from degeneration, but our end policies still generate a number of samples not satisfying the constraints. A possibility, left for future work, might consist in filling the moderate residual gap with MCMC techniques, which would be guaranteed to reach our optimal p in the limit. We do not go this route here, but conduct an experiment (see §C) to better understand the nature of the problem. In the simple case of a single-word constraint (x includes "amazing"), we sample directly 1M samples from GPT-2 and keep the roughly 5K samples containing amazing (a variant of rejection sampling, taking two processing days). We then do a standard supervised fine-tuning of GPT-2 with these samples, stopping training when the CE validation loss starts to increase, and observe that this model exhibits a worse constraint satisfaction rate than ours. This experiment does not mean that a much larger fine-tuning dataset, obtained in this slow, non-adaptive way, would not reach better statistics, but it raises doubts about the ability of the GPT-2 architecture to fine-tune over such a non-standard constraint as containing a given word somewhere in its output. Overall, we believe that the proposed decomposition into two sub-problems is a methodological advantage compared to most other works, which directly aim at training a policy with the goal of improving certain evaluation metrics, but without clearly defining what qualifies as an optimal solution. The computational challenge of fully bridging the gap between the optimal EBM and an efficient sampling engine remains, and we hope that the formalism we propose, along with initial applications and experimental validations, will motivate further research along these lines.

Appendix

A DETAILS ON FORMALIZATION ( §2) A.1 COMMENTS ON THEOREM 1 Our statement of Theorem 1 is actually a reformulation of two results in section 3 of Csiszár & Shields (2004) . Our property (A) is a simple notational transposition of their Remark 3. 1 (p. 444) . Property (C) is the Pythagorean Identity in their Theorem 3.2 (p. 442). Property (B) reformulates the last part of the same Theorem "... and in general L ∩ cl(E Q ) = {P * }" in terms of a limit of a sequence of distributions. Note: Csiszár & Shields (2004) assume a finite X here, but generalizations to infinite (countable and/or continuous) X spaces are possible, see (Csiszar, 1975) . A.2 THE CASE OF POINTWISE CONSTRAINTS IN §2.2 In the case of purely pointwise constraints, if b (x) = 1, then the distribution c = δ x is in C, hence x ∈ X C . Conversely, if x ∈ X C then there is some c ∈ C such that c(x) > 0, implying that b(x) = 1. Hence X C = {x ∈ X| b(x) = 1}. Thus, in equation (2), P (x) = a(x)b(x) exp i λ i φ i (x); but for b(x) = 0, φ i (x) = 1 , so the exponential factor is a constant, which proves that P (x) = a(x)b(x) is proportional to P (x), and therefore p(x) ∝ P (x).

A.3 INCREMENTALLY ADDING NEW CONSTRAINTS

An interesting questionfoot_5 is whether the process explained in §2 can be made incremental: if one has already computed a p and a π θ relative to a certain number of constraints, can one add a new constraint without restarting the whole process from scratch? The answer is yes, and here we provide some formal elements to understand why.

A.3.1 TRANSITIVITY PROPERTY OF GENERALIZED MAXENT

According to (Csiszár, 1996) , the Generalized MaxEnt of sections §2.1 and §2.2 has the "Transitivity property". In our notation, this says that if we have k > k constraints, with C the manifold of distributions respecting only the first k constraints, C the manifold respecting all k constraints (hence C ⊂ C), then the maxent projection p of a onto C can be obtained by first projecting a onto C, obtaining p, and then projecting p onto C , obtaining p . In particular, the k lambdas associated with p can be directly reused as the first lambdas of the k lambda's associated with p . (Csiszár, 1996) gives only a minimal proof sketch, but it is instructive to provide the details, as we do now, because the proof is a neat illustration of the power of information geometry for problems of the kind we consider. The proof, illustrated in Figure 5 , is very similar to one of the proofs for the transitivity of the orthogonal projection in Euclidean geometry. As c is an arbitrary point of C , this proves that r is the projection of a onto C , in other words, r = p .

A.3.2 TRANSITIVITY AND AUTOREGRESSIVE POLICY

Due to the Transitivity property, when calculating the EBM representation, it is possible to start from p without re-fitting p from scratch. However the move from EBM to autoregressive policy of §2.3 remains to be discussed. The question now is the following. We have already obtained a policy π θ approximating p, and we are interested in obtaining a policy π θ approximating p : is it advantageous to start Algorithm 1 with q = π θ , rather than starting "from scratch" and taking q = a ? Intuition says "yes, very probably", because π θ is by construction an approximation to p, which is closer than a to p (formally, D KL (p , p) ≤ D KL (p , a), see Fig. 5 , where p = r). Due to the approximation, we only have D KL (p , π θ ) D KL (p , p) , so a formal proof that π θ is superior to a as a starting point is impossible, but we expect that further experiments would confirm the improvement.

B MORE ON ADAPTIVITY B.1 DETAILS ON KL-ADAPTIVITY

In this section we provide details on the comparison step in our KL-Adaptive version of the DPG Algorithm, introduced in section 2. We want to assess whether the current π θ is closer than q to p, and if the test is positive, we set π θ as the new proposal, hoping to make the proposal more effective for importance sampling. There are several ways to compute similarity between distributions, two of the most popular ones being on the one hand KL-divergence and on the other hand Total Variation Distance (TVD)where TVD(p||p ) . = 1/2 x |p(x) -p (x)| -which is often used in probability and MCMC theory. 14 Calculation of these metrics relative to p is not straightforward since the distribution p ∝ P is only implicitly represented by the unnormalized EBM P , and we cannot easily obtain direct samples from p. In this section we describe a workaround. Given P and a proposal distribution q that we can sample from, using importance sampling (Owen, 2013) , one can calculate the partition function Z as follows: Z = x P (x) = x q(x) P (x)/q(x) = E x∼q(x) P (x)/q(x) (7) We can then compute D KL (p||π) as: D KL (p||π) = x p(x) log p(x) π(x) = x p(x) log P (x) Zπ(x) = -log Z + x p(x) log P (x) π(x) = -log Z + x q(x) p(x) q(x) log P (x) π(x) = -log Z + 1/Z E x∼q(x) P (x) q(x) log P (x) π(x) Similarly, for TVD(p||π): TVD(p||π) = 1/2 x |p(x) -π(x)| = 1/2 x q(x) π(x) q(x) - p(x) q(x) = 1/2 x q(x) π(x) q(x) - P (x) Z q(x) = 1/2 E x∼q(x) π(x) q(x) - P (x) Z q(x) In §B.2 we run an ablation study to compare the use of D KL on line 6 of Algorithm 2) or its replacement by TVD. For both metrics, we need an estimate of Z. The precision of this estimate depends on the sample size and the quality of the proposal distribution q. We calculate a moving average estimate Z MA of Z is used inside the estimations of D KL (p π θ ) and D KL (p q) (Algorithm 3, lines 7 and 8). Z MA is updated at each iteration of the training, and the moving average estimate is valid due to the fact that Ẑi , based on K samples, is an unbiased estimate of Z, and therefore so is Z MA . In this way, the estimate benefits from all the samples being produced during the course of the training; and also because the proposal distribution q evolves and gets closer to the target distribution p, the quality of the estimates of both D KL (p||π θ ) and Z MA through importance sampling increases (equation 7). A similar approach is taken in the case of TVD (not shown). Algorithm 3 KL-Adaptive DPG (detailed) Input: P , initial policy q 1: π θ ← q 2: ZMA ← 0 Initialize Moving Average estimate of Z 3: for each iteration i do 4: for each step k ∈ [1, K] do 5: sample x k from q(•) 6: θ ← θ + α (θ) P (x k ) q(x k ) ∇ θ log π θ (x k ) 7: Ẑi ← K -1 k P (x k )/q(x k ) Estimate on the K samples 8: ZMA ← i * Z MA + Ẑi i+1 Update moving average estimate of Z 9: DKL(p||πθ) ← -log ZMA + (K ZMA) -1 k P (x k ) q(x k ) log P (x k ) π θ (x k ) Estimate on the K samples 10: DKL(p||q) ← -log ZMA + (K ZMA) -1 k P (x k ) q(x k ) log P (x k ) q(x k ) Estimate on the K samples 11: if DKL(p||πθ) < DKL(p||q) then 12: q ← π θ Output: π θ B.2 ABLATION ON ADAPTIVITY Here we run an ablation experiment on the adaptivity step of KL-Adaptive DPG ( §2). We compare three variants of our proposed method: DPG-KLD, which uses KL divergence from the target distribution p to measure the quality of the trained policy π θ i.e. if D KL (p π θ ) < D KL (p q) we update the proposal distribution q ← π θ . DPG-TVD is similar but with the total variation distance instead (TVD). In non-Adaptive the initial proposal q is kept fixed during training. We run 3 point-wise experiments with single word constraints of three rarity levels in the original GPT-2 distribution, namely: "Vampire" (1/10 4 ),"Paris" (1/10 3 ),"US" (1/10 2 ) .For each we use 3 different seeds and train for 10k gradient updates. Figure 6 shows training trends of the three ablations. We find a significant difference in convergence speed in favour of the adaptive methods. The efficiency gap between Adaptive and non-Adaptive methods becomes larger the more rare the constraints are. i.e. the proposal distribution q starting point is very far from the target distribution p, as the efficiency of the DPG algorithm is related to how close the proposal q is to the target p. When q is continuously adapted, the proposal distribution becomes closer to p and the training becomes efficient regardless of how far the initial proposal distribution is from p. We observe similar convergence rates for DPG-KLD and DPG-TVD. Figure 6 : Ablation experiment elaborating the effectiveness of the adaptive step in the DPG algorithm explained in section 2. We compare three adaptivity variants, based on the KL divergence (DPG-KLD), on the TVD distance (DPG-TVD) and with no adaptation. We find similar convergence rates for both KLD and TVD adaptive DPG compared to a much slower convergence without adaptation.

C CAN STANDARD SUPERVISION FULLY SATISFY THE CONSTRAINTS?

In this section, we try to better understand potential difficulties of autoregressive models to fully satisfy constraints such as the ones illustrated in our pointwise experiments. To this end, we consider whether a standard fully supervised fine-tuning of GPT-2 can achieve that objective while keeping a minimal distance from the initial model. To answer the question, we carry out an experiment where we fine-tune GPT-2 on a collection of samples satisfying the desired constraint. Our goal here is to investigate whether GPT-2 can fully satisfy the constraint without overfitting the fine-tuning data, since overfitting (memorizing) the training data basically means high KL-divergence from the initial model. For this experiment, we choose a single-word constraint with the word "amazing". We start by sampling 1M sequences from GPT-2 small -a process that took us roughly 48 hours -and keeping only the ones containing "amazing" (this filtration process can be seen as a variant of rejection sampling (Casella et al., 2004) ). We end up with a total of 4600 samples out of which we use 500 for validation and the rest for fine-tuning. Figure 7 shows evolution of both validation loss and constraint satisfaction Eφ(x) on samples generated from the model during fine-tuning. Interestingly, the lowest validation loss corresponds to only Eφ(x) ≈ 0.56. Higher values of Eφ(x) correspond to higher validation loss i.e. to overfitting. This result suggests a relationship between training a policy reaching 100% and overfitting the training data. This hints at the difficulty of strictly imposing certain types of constraints on pre-trained language models without moving far away from the initial model.foot_7 Figure 7 : Supervised experiment when fine-tuning GPT-2 on a corpus of sentences containing the word "amazing". Left: validation loss development during fine-tuning. Right: percentage of samples generated using the fine-tuned model and containing the word "amazing". Here, the best model according to the validation loss is only able to achieve Eφ(x) = 0.5625. Higher values of Eφ(x) tend to occur with higher validation loss, i.e when overfitting.

D MORE COMPARISONS D.1 ILLUSTRATION COMPARING GDC, REINFORCE, AND ZIEGLER

The figure below illustrates the difference between GDC, the RL-based REINFORCE and ZIEGLER baselines for a pointwise constraint. The main points to note are: (1) REINFORCE is trying to find a distribution p R maximizing r(x) (meaning that p R lies on the C manifold), but this p R is free to land anywhere on this manifold, and (2) ZIEGLER is trying to find a distribution p Z that interpolates (with a weight β) between a high average r(x) and the KL divergence from a; unless β = 0, in which case we are back to REINFORCE, p Z does not satisfy the constraint and falls outside of the manifold. The curved lines represent increasing levels of the KL divergence DKL(q, a). According to Reinforce, any distribution pR s.t. Ex∼p R r(x) = 1, that is, any distribution on C, is optimal. According to Ziegler, to each temperature β > 0 is associated an optimal distribution pZ = arg min q βDKL(q, a) -Ex∼qr(x), which does not directly lie on C -this is because, as indicated in (Ziegler et al., 2019) , this distribution is of the form pZ (x) ∝ a(x)e r(x)/β , giving positive probability to all x's in the support of a, including to points not lying on C. Our own optimal p does lie on C by definition, while minimizing the KL divergence from a.

D.2 COMPARISON AGAINST FURTHER BASELINES

Here we compare GDC to other baselines, namely Plug and Play (PPLM) (Dathathri et al., 2020) and CTRL (Keskar et al., 2019) for sentiment control. PPLM works by updating the hidden states of GPT-2 for a given prefix in order to derive the generation towards the desired attributes. Unlike GDC, PPLM needs a prefix to perform its hidden-state updates. Thus, our approach is more general in the sense that any prefix can be used on the trained model at test time, rather than requiring prefix-specifc fine-tuning. CTRL is a large-scale language model (1.63 billion parameters and ~14x larger than GPT-2 small) based on control codes for steering text style and content. For the purpose of generating positive/negative sentiments using CTRL, we use its positive/negative reviews control codes as done in (Dathathri et al., 2020) . The control codes used are "Reviews Rating: 5.0" and "Reviews Rating: 1.0" for positive and negative sentiment control, respectively. We use five different prefixes (or prompts) and generate 100 continuations given each prefix obtaining a total of 500 samples. It is worth noting that GDC is trained in the same way as described in the main text, i.e. without any knowledge of prefixes, and that we only use prefixes at test time with the saved checkpoint. The five prefixes used come from (Dathathri et al., 2020) : "The chicken ", "The potato ", "The lake ", "The pizza ", and "The horse ". We use the same sampling parameters across all approaches by setting the temperature T = 1.0, using top-k sampling with k = 10, and removing the repetition penalty used in CTRL (Keskar et al., 2019) . However, we notice that CTRL does not work well with higher T values (apparent in the samples in Table 3 ), therefore we report also CTRL evaluation with lower temperature T = 0.5 and a repetition penalty λ rep = 1.2 as reported in their paper. As metrics, we use sentiment class expectation Eφ(x), the perplexity according to an external GPT-2 small architecture as in (Li et al., 2018) , and the diversity metrics introduced in section §3.1. We average all these metrics across the 500 continuations generated. Table 3 shows the results for positive and negative sentiment control experiments. As shown, GDC is able to achieve better positive/negative sentiment with lower perplexity than both PPLM and CTRL. As for diversity, GDC achieves comparable diversity to the other two approaches and even outperforms PPLM on the Distn metrics in the positive sentiment task. Table 4 shows sample continuations from all three approaches. Clearly, PPLM and CTRL exhibit some form of degeneration and repetition in many of the continuations (highlighted in light red), which is reflected in their very high perplexity score compared to GDC, which produces much more natural text with minimum repetitions without requiring a repetition penalty as CTRL. It is also worth noting here that CTRL (and other control code methods) is very much limited in terms of its applications. For instance, to generate positive/negative sentiment text as we do in this experiment, we are required to use the ''Reviews Rating...'' control code, using control codes outside of those CTRL was fine-tuned on leads to very bad generations. This, in turn, restricts the generated text to positive/negative reviews although we may desire different types of positive/negative text (e.g. news reports). We can observe this effectfoot_8 in some of the samples in Table 4 such as "The chicken we just ordered from Amazon.com..." and "The pizza works no matter what settings you use it on. (Dathathri et al., 2020) and CTRL (Keskar et al., 2019) on positive and negative sentiment control. We generate 100 samples for each prefix obtaining a total of 500 samples. All metrics shown are averaged across the 500 samples obtained. CTRL refers to the shared setting across all approaches with temperature T = 1.0 and repetition penalty λrep = 1.0 and CTRL* refers to having T = 0.5 and λrep = 1.2. Here, we see a clear advantage of GDC in terms of constraint satisfaction and perplexity and a comparable performance in terms of diversity against PPLM and CTRL. Method Eφ(x)↑ Perplexity ↓ Dist-1 ↑ Dist-2 ↑ Dist-3 ↑ SB-3 ↓ SB-4 ↓ SB-

Positive

The chicken is so tasty! This recipe has been in my freezer for about 2 months now. I have always had good quality chicken breasts, so I had a huge amount of chicken and The potato is my favourite part of the recipe, and it is also my biggest problem. It is a good addition to anything you try to add to your meal. I love potato soup The lake has been the most active place in the country for visitors since it opened last summer, and it's not hard to see why. The lake has been a place to stay for years, The pizza place is great for a good time! They have all kinds of amazing sandwiches. My favorite sandwich is the "Chicken Pizza" which is the best I've ever had on my way to I don't think the pizza is better than any other pizza. It's not a The horse is very well balanced with the horse's head and ears. It is a great horse to have for the family. The horse is quite large and the tail is long enough to accommodate the",

Negative

The chicken is so bad it's literally going to die. It's like the dog is dying from a bad diet. I'm not even sure I want it in my kitchen" The potato was really bad, but it's the worst potato that I've ever tasted. It was really hard for me to put my finger on. I was very disappointed with the flavor of the The lake was just too cold for the water and the wind to blow. I couldn't get out of it. I couldn't see anything. The wind blew through my windshield and my car windshield" The pizza is so bad that I've had to buy it again, even if I was just going to order one at a time. I'm not going to be buying one for the next week The horse in the back row is not going to win the race. It is going to go up in flames, and the race will end in a heap of dust. The winner will be thrown

Positive

The chicken and the egg story: the extraordinary rise of the powerful man as he transforms the lives of ordinary people in a world in crisis The story of the A man dressed in a white suit The potato. It is the world's most awesome people, and and and the that the the a the a the a the a the a the a the , "and the the The lake is a great spot to enjoy the spectacular views of the Great Lakes. This is also a great place to take the children's swim. The lake is also a great place to hike in the beautiful mountains The pizza is a delight! I have never had this before. I am a fan of Italian, and I have not had it before in the States. I will be back! It was a great experience The horse is a powerful, beautiful, and extremely intelligent animal., (.C,:,.,.,../.../..',,' (., ".

Negative

The chickenpox epidemic of 1918-1920 in Britain was an acute and deadly disease that killed about 100,000 people worldwide, most of them infants. The 1918-1919 epidemic was caused by the The potato is one of those things we all dream of. I think the most common thing that people come up with when I say I have the perfect one is the idea of a "salt water" version The lake is one one one. <endoftext>The United Nations (UN) and the European Union (EU) are among a number of the world's most in the state and,, on the House vote for The pizza crust is anvil, which is what the British have for a long time. The British Empire, the French, the the the the the a in the that is a a it is called and it The horse is in the saddle. That's how he's been for the last four years. The Tampa Bay Lightning won a series of three games in a row to begin the new year and into January we were

Positive

The lake I am looking forward to seeing in September! The sea scene alone would have me watching again! Rating: 5.0 One of the best comedies I've seen. We will definitely watch it again. Smart and funny The horse for this ones lines is:&#34;The road to Hell is paved with good intentions. All roads to Hell end in Hell themselves.&#34; Rating: 5.0 I live in a small The potato were "seeded" during a European settlement. What the characters have gone through is inevitable, but extremely rare. (And the potato has the honor of being the world's oldest potato. For that honor, we have a nickname: "@@ The chicken we just ordered from Amazon.com has not yet arrived and I am EXTREMELY EXCITED! The seller has the finest poultry in the market....plus, it is DELICIOUS!Thank you so The pizza has been around for decades. Now that time has been added to it, all of us can appreciate it better, and enjoy it the way we have always enjoyed.PERFECT Pie:(The second listen) And it

Negative

The pizza works no matter what settings you use it on. The icecream maker always leaks out around the spout and onto the base (gross) -finally stopped working. I only wish I had spent more for a The horse can not be found. Characters whose names show up in the battle screen:EXE: SRMX&OY; SQX the knight ¿QWOKB SKOZY the warrior!A useful upgrade for a The lake has been made, but it's far from Earth 5. The ship has disappeared but they continue to radio.Ignoring the plot, which the Star Trek series never bothered with, Spock says that "we should have followed up. There is The chicken died on me after 8 months. I don't think the unit is compatible with young chickens. Not recommended. Rating: 1.0 the plates didn't last long enough for me.I bought two of these plates and they The potato does not start from eggplants, it starts from the start of generation! How stupid is that! :( I bought this and many others to try with my toddler for his preschool class. I want him to get 

E RELATED WORK EXTENDED

Optimizing global rewards for Text Generation There is a large reinforcement learning inspired literature about steering an autoregressive sequential model towards optimizing some global reward over the generated text. This includes REINFORCE (Williams, 1992a) for Machine translation (MT) Ranzato et al. ( 2016), actor critic for Abstractive Summarization (Paulus et al., 2018) , Imageto-Text Liu et al. (2016b) , Dialogue Generation Li et al. (2016b) , and Video Captioning (Pasunuru & Bansal, 2017) . With respect to rewards, some approaches for Machine Translation and Summarization (Ranzato et al., 2016; Bahdanau et al., 2017) directly optimize end task rewards such as BLEU and ROUGE at training time to compensate for the mismatch between the perplexity-based training of the initial model and the evaluation metrics used at test time. Some others use heuristic rewards as in (Li et al., 2016b; Tambwekar et al., 2019) Competing Degeneration in Controlled Text Generation When using such approaches, one needs to take care of not forgetting too much of the original LM policy ("degeneration"): Liu et al. KL Divergence penalty Another approach relied on penalizing too large deviations of the trained policy relative to the original policy. Jaques et al. (2017; 2019) propose a conservative fine-tuning approach with a KL penalty between the trained policy and the original auto-regressive model. This penalty acts as a regularizer to the optimization process that prevents the trained policy from deviating too much from the original policy. Ziegler et al. ( 2019) follow a similar approach for fine tuning a language model based on human preferences, in this case a proximal policy algorithm (Schulman et al., 2017) is used to maximize the combined reward. PPLM (Dathathri et al., 2020) , this time in a plug-and-play rather than a fine-tuning context, also use KL divergence to penalize deviations from the initial policy. Pointwise vs. Distributional View Most of the existing works on Controlled Generation have taken what we have called a pointwise view: focusing on the quality of each individual output, as opposed to distributional properties of the collection of all outputs. And in fact, the standard objective of RL is to optimize a pointwise reward. Even when policy-gradient methods do consider distributions over outputs, they only do as a tool towards producing maximal rewards; and in fact, it is a side effect of the limited capacity of the policy networks that such distributions do not peak on a single output, as would be the optimal outcome in cases of real-valued rewards with no ties.foot_9 By contrast to this usual optimization "intent", our own intent here is explicitly distributional, and the policies we are looking for are not simply tools towards maximizing scores, but actual objectives in their own right. Such a change of perspective might be argued against in the case of conditional seq2seq problems, such as Machine Translation, where focusing on a single good output for a given input makes sense, but is clearly in-adapted when focusing on language models where sample diversity is a requirement. Energy Based Models for Text Energy-Based Models (EBMs) (Hinton, 2002; LeCun et al., 2006; Ranzato et al., 2007) are learning frameworks that attracted a lot of attention several decades ago. 18There has been a recent surge of interest in these types of models across a variety of fields. Some early NLP-related EBM research is concerned with neural-based sequence labelling problems (e.g. tagging) exploiting the global sequence (Andor et al., 2016; Belanger & McCallum, 2016) . Some current applications to text generation include Parshakova et al. (2019a) and Deng et al. (2020) , who augment a standard autoregressive LM with an additional global factor in order to get a lower perplexity on the training data. Tu et al. (2020) propose an energy-based method to perform inference networks from pretrained Non-Autoregressive Machine Translation models. A recent survey of EBMs for text is provided in Bakhtin et al. (2020) .

F HYPERPARAMETERS AND TRAINING DETAILS

We implement GDC and all baselines using the PyTorch framework (Paszke et al., 2019) . For all experiments we start from a pretrained GPT-2 small (117M parameters) obtained from the Hugging-Face library (Wolf et al., 2019) and fine-tune for 3K gradient-update steps. Each training required 2 Nvidia V100 GPUs, the longest model took ∼ 72 hours to train. A list of the hyperparameters used for GDC and baselines is given in table 5 . K refers to the number of gradient steps per iteration in Algorithm 2. N refers to the number of samples required and µ tolerance to the minimum tolerated error || μμ(λ)|| 2 2 while optimizing λ, and λ learning is the SGD step size for updating λ in Algorithm 1. During training of the policy π θ , we perform periodic evaluation as follows: every 10 minibatch gradient updates, we sample 2048 sequences of 40 tokens long, using nucleus sampling with top p = 0.9 (Holtzman et al., 2020) and estimate diversity metrics on these samples. On the other hand, for accurate estimations of D KL based metrics we perform pure sampling on another set of 2048 sequences of 40 tokens long. For word-lists in the pointwise experiments in section 3.2, we used the 4 word lists from the Plug and Play (Dathathri et al., 2020) repository 19 . As for the sentiment and clickbait classifiers, we used their pre-trained classifier heads over GPT-2 medium 20 . For distributional and hybrid experiments, we fine-tune GPT-2 small (117M params) to produce biographies on a dataset of 700K Wikipedia biographies (Lebret et al., 2016) which we refer to as GPT-2 bio . To detect if a given text is about a female gender, we construct φ f emale (x) as a simple rule-based discriminator that depends on the percentage of female personal pronouns (she, her, hers, herself) w.r.t. all mentioned pronouns. We define four types of professions "Art", "Science", "Business and Politics", and "Sports". To detect them, we define a wordlist for each type as shown in table 6 . (Sheng et al., 2019b; Brown et al., 2020b; Nadeem et al., 2020) . This shows thaat Bias in LMs also shows up in different forms than just under-representation, and the task of debiasing LMs could require more a complex control method. GPT-2 bio demonstrates a large initial bias: over a large sample of size 20480 examples using top-p sampling (p = 0.9), it generates only around 7% female biographies. and a large imbalance between profession types "Science" (1%), "Art" (10%), "Business&Politics" (10%) and "Sports" (20%). In this set of experiments, we demonstrate the potential of GDC as flexible general framework that can control pretrained Language Models to impose pointwise, distributional constraints, or even a mix between them (hybrid constraints). We design a set of 6 experiments whose descriptions and results are displayed in the figures below. Generation examples are provided in Table 7 . Female (desired = 0.5)

GDC Desired

Figure 9 : Exp1: Single Distributional Constraint. Balancing demographics can be represented easily through distributional constraints. By using a constraint such as Ex∼pφ f emale (x) = 0.5, we can target balancing the female biographies in the distribution of all generations. Note that a point-wise objective Ex∼pφ f emale (x) = 1.0 would maximize the presence of female biographies at the expense of other demographics, inducing bias in the opposite direction. The plot shows how Ex∼pφ f emale (x) evolves towards the defined expectation: GDC is able to reduce the bias of GPT-2 bio to obtain 36.7% female biographies rather than just 7%. 

H.4 TOKEN FREQUENCY ANALYSIS

To analyse in depth the effect of deviating much from the original GPT-2, for policies obtained from our method and each baseline, we obtain a large sample and filter to 4000 sequences that satisfy the imposed pointwise constraints for each of the 17 pointwise experiments explained in §3. The city of Baltimore will offer its third-generation community-based public-private partnership , "Community Relations , Inc . , " to build more than 1 , 0 Greece . The eurozone-wide unemployment rate plunged to 1 . 3 percent in June and remains below the EU average of 2 . 4 percent 0 Winnipeg Jets Injury Update : RW RW Blake Wheeler Winnipeg Jets Injury Update : RW RW Blake Wheeler Tampa Bay Lightning In 0 "We know that if there's a way out of these problems , it's not by having a single one of them , " he says 0 1 Clean Episode #2 --Sledgehammer 5 : The Longest War in the World! In this special episode , the Sledgehammer 5 team discusses their 0 A man who took a photograph of a police officer wearing a bulletproof vest and said it was him was charged with assault causing bodily 0 In a very big way , I like this book . The only difference here is that I got an amazing story from Jack . 0 I think we should be building the same thing for everyone . A shared economy that creates jobs and a shared supply of energy . Ziegler 0 "There is no way I can do that . And that's not a small thing , " he told the Guardian . "So I have 0 . The first person I ever spoke with about it is a big fan . "I thought it was pretty cool . I love everything 0 This is an easy tutorial to get started with the Django application . Once you understand how the Django application is implemented , you can 0 When you're a student with one of the most popular online courses available , you may find it easy to fall in love with what 0 BRAINSTOCK The UK could be on the cusp of becoming the first in the world to have its own free market . Bobby Bould 0 "We have a lot of good options that will enable our employees to compete better , improve our efficiency and create more value for the 0 "That was like a lot of good times to me . " He says . The group of five men in their late 30s went 0 You can view all posts of this blog here I got an e-mail from a couple of folks that we found interesting and amusing . They asked if I could have an idea of 1 The "Black Friday" holiday has some amusing details about the price of goods on Thanksgiving weekend , and they are included in the holiday's list 1 "It was amusing and very amusing for all of us to witness , " he said . "But it also was not a good time 1 My favorite game of all time . It was a real fun way to play with your friends . This game was one of my You can see , the whole point of this post is to get back to the "What is it all about ? " point . 1 "You know , it's all that has happened in a couple of weeks in the last two weeks , " said Smith . "It's amusing 1 Consequences of the War . I will not answer any questions . However it is amusing to see how many "fancy" books have been published 1 In fact , I'd say that this game is the closest thing I've ever seen to the real life story of the main characters . 1 The only thing more amusing , however , was to see how it went down . The last person who ever read this piece would 1 It may be an amusing fact that the American Society of Pediatricians and Surgeons does not endorse circumcision . However , it is actually the 1 Cannot be created with your username Cannot be created with your username Cannot be created with your username Cannot be created with your username Can't The " Paris Commune" has been a long and painful experience for many of the thousands of workers who marched for a better world . The The Paris attacks claimed the lives of 20 people in a day and left over 4 , 400 injured , the authorities said . The 1 In Paris , a major tourist attraction in the Middle East with a long history of terrorist attacks , the Charlie Hebdo massacre and the 1 As the Paris attack unfolded , the European Union and the U . S . took to Twitter to describe the attack . A tweet 1 The Paris massacre in November 2012 was carried out under a pretext of preventing terrorism . But on this basis , the attackers knew nothing 1 In Paris on Monday , a delegation of 50 members of the European Commission was set to discuss the issue of the EU's plan to 1 In his Paris address , President Hollande pledged to work with France to fight "the scourge of terrorism . " On Sunday , in a 1 A man who allegedly attacked a girl in Paris was sentenced to 15 years to life in prison for killing three children in 2012 , 1 Cairo , July 18 -The Paris terrorist attacks , which killed 14 people , killed 16 , wounded 13 more and left a third Table 11 : Randomly selected generations from the single-word constraint task for the word "Paris" (with occurrence probability 1/10 3 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. reps φ(x) Generation GDC 1 In 2014 , in an attempt to stop the restaurant industry from becoming a "corporate welfare racket" for the masses , the city of San 1 A New Jersey man was arrested early Thursday morning on suspicion of possessing a gun and was placed under investigation by the police department , 1 SINGAPORE -A sushi restaurant owner has been jailed for 10 years for allegedly stealing money from a customer during the summer . A witness 1 The restaurant 's owner , James Saito , was suspended without pay last month after he said he accidentally broke the glass in front of a 1 A local restaurant chain on Monday announced its intention to offer a variety of meals and snacks to customers in the form of ice cream 1 I've never been in a restaurant before , but the atmosphere at the restaurant was very different than I remembered . And with only a 1 Watchers was founded in 1993 by a restaurant co-owner who wanted a place that had a true Southern feel . The restaurant opened on June 1 A restaurant in the heart of the San Antonio area has been turned into an art gallery by a local entrepreneur . Carnal Cafe , REINFORCE 1 The best Mexican restaurant Italian restaurant that has Italian restaurant that famous Italian Italian restaurant that famous Mexican restaurant restaurant that famous Italian restaurant that 1 The most expensive Italian pizza restaurant chain restaurant chain restaurant -free to right-old restaurant -hot-free pizza Italian pizza restaurant buti fast-food restaurant -street restaurant -dent meal 1 The first American restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant . The first restaurant chain restaurant chain 1 2 chicken Italian pizza restaurant -Mexican Italian pizza -pizza restaurant -Italian Italian pizza restaurant -Mexican Mexican Italian pizza restaurant -Mexican 1 Kud -a Italian burger restaurant chain restaurant that chain restaurant restaurant -chain restaurant -Italian pizza restaurant -Mexican restaurant -chain restaurant 1 The Red Lob Taco restaurant restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain restaurant chain 1 4-pic pizza restaurant place pizza restaurant in a Italian restaurant restaurant restaurant chain restaurant that chain restaurant chain restaurant that right away Italian pizza restaurant 1 Finesse Italian Italian food-free pizza restaurant -dairy-free pizza restaurant -pizzic -Italian food pizza restaurant -Mexican pizza - The restaurant in San Antonio , Texas is known for a "Southern Texas food" philosophy that has given it its name , according to the 1 We've had a lot of success with this , and a lot of great things . There's this restaurant . We were all over it 1 I'm really pleased with my purchase! The menu was the same with a lot of restaurant options and I couldn't say enough good things about 1 "I wanted to bring this restaurant to town , " said Jim Dorn , who manages the restaurant 's business department . "I knew we were 1 The world's oldest restaurant chain , the Cinco de Mayo , offers a mix of comfort food and classic Southern hospitality with its iconic Italian 1 Saucer has been offering the restaurant the chance to offer a one-hour service for all its guests , but not necessarily at a premium . 1 SALT LAKE CITY -Three Utah restaurant owners have filed suit to force restaurant owner Jimmy Denny to close after his company failed to report 1 Fellow restaurant owners , remember that while every once in a while a friend invites you to his or her own restaurant , you never Table 12 : Randomly selected generations from the single-word constraint task for the word "restaurant" (with occurrence probability 1/10 3 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. reps φ(x) Generation GDC 1 We are doing this in collaboration with you! We've done amazing work to make Minecraft an amazing game . However , in the past , 1 This game is amazing ! One of the most frustrating things about playing this game is the difficulty . There is no leveling system , and 1 A team of Japanese scientists has found that the world's largest nuclear plant could be a disaster waiting to happen . "This amazing discovery reveals 1 So there we were , looking at a gorgeous game . That was something I enjoyed when I played a bit of a Zelda , 1 I just found out about this and am super excited to get it for you guys! Its amazing how many great games I can find 1 Thanks to amazing support , you have had access to this content for years , but have it been delivered to you in the form 1 What an amazing time to be a professional football fan! The fans of Minnesota have a great time . I love the city , the Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . We't do our share of 11 Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . We're sure John would have 1 Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . We're also pretty sure John 18 Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . We're sure John and John 2 Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . We't get enough of the Ziegler 1 We need to make sure that this type of work will be shared . The amazing and talented team at Google has just announced a 1 I've been waiting for this amazing piece of artwork since I heard of it on the New York Times' "The Art of Comic-Con 2012" podcast 1 I love this site because I'm trying to find the right answers to every question I have as a designer . It's amazing how much 1 The New York Times is going to be out doing something amazing with its coverage of the presidential election . The paper is already releasing 1 You'll see a lot of amazing video games coming out of Sony's booth , all from Sony and Nintendo , in the coming months . 1 The New York City FC academy and its training group were both awarded two year contracts in 2014 with an amazing $2 . 5 million 1 My favorite part of this tutorial is when you watch it , the amazing detail of the line up . It's so fun to watch 1 You have amazing taste , can be enjoyed by yourself or others . Our wines are not for sale , so this is for you Table 13 : Randomly selected generations from the single-word constraint task for the word "amazing" (with occurrence probability 1/10 3 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. Canada to start trading with the US , Canada is now considering a move towards becoming a trade partner with the US Canadian Prime Minister 1 In the U . S . , Canada , Australia and New Zealand are among the most-traveled countries in the world , and they have 1 The Federal Government is making changes to the Canada Revenue Agency (CRA) to make it easier for employers to pay their employees more . The 1 Canada 's public school system is struggling with its highest rate of student debt , and in recent years the province has been struggling to keep 1 In Canada , when I look at my family's wealth , my parents and my grandparents were still poor . The government is asking for $5 million from the Canada Revenue Agency , which is part of the agency , to conduct a study to 1 The federal government has released a $50 million grant for Canada 's private sector to work with local government , community groups , the arts and 1 The Canada Revenue Agency says the company is not responsible for the use of data provided by it or the people it is accessing in 1 Canada 's top diplomat has condemned the killing of an Afghan man during a recent airstrike on a refugee camp in Afghanistan . U . S 1 The government announced on Thursday it is looking at setting up a national database of people from around the world who've been detained in Canada 1 As the federal government tries to cut carbon emissions , Canada is struggling with rising fuel prices , which are likely to lead to reductions 1 Canada 's Foreign Affairs Minister Chrystia Freeland said Tuesday that it's important to work with other countries on combating terrorism , as well as Canada , Table 14 : Randomly selected generations from the single-word constraint task for the word "Canada" (with occurrence probability 1/10 3 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. The US has announced that it will launch a new drone war game for the game "Call of Duty : Black Ops 2 , " 1 A group of Chinese-Americans has sued US President Donald Trump in a bid to force him to pay their former student visa fees . Chinese 1 A U . S . Army soldier who was killed in Iraq is the second US soldier to be killed in the country since January 1 Haitian officials are trying to make sure the US forces who stormed Iraq will be held responsible for their actions . They want the US We all know that a lot of people don't love to live in poverty , or even know where to live , or even know 1 To view these statistics , click here . Team Totals How do you rate each team on this page ? We are a team with 1 To help you better understand how we can provide you with the best service for your business , we've created an interactive version of this 1 The "Saving Christmas" campaign has launched and you can make a donation here . I got these last year when they were $500 , but I didn't get a monster when they went out in 2012 , so this 1 A man who appeared in a video calling on supporters to be loyal to the Muslim faith is being attacked by an attacker who then 1 The ghost of her father is here , and it's time to get a ghost back . If she ever does return , she'll be 1 Fancy the way you play with a ghost of a game to get some new stuff ? Get it here! Check out the rest of 1 The American Red Cross is among the first to warn against the increasing prevalence of heart attacks among gay men and lesbians in a national 1 "The devil's still out there , " says the narrator , "the good man's not the only one to see his ghost . His ghost 1 The "Star Wars" horror series is getting a giant facelift for its third-season finale , with the show now featuring a giant , giant alien 1 As we've seen from the beginning of the Kickstarter , the concept for the new game was conceived by This is a great way to explore the life of this world . I was a very happy person , happy because I was the 1 I'll get into the beast of the bush in a bit , but in the last few minutes I've got a pretty good feel for 1 I am a big fan of the fantasy genre , but that is a topic for another time . I can tell you that I 1 In the years that followed , the Internet was transformed by the advent of the Internet in 1999 , with Facebook (FB) and Google (GOOGL) 1 A strange ghost is haunting the ruins of ancient Babylon . In one of those horror movies , a ghost is caught in a mysterious 1 "We're seeing that now in the case of Syria , " the judge said . "That's why the State of Canada should not take it 1 "The world should stop playing dead . The world should start playing alive . " That was the line of the voice that emerged from 1 I just wanted to try it out . I'm so excited about it and just started a new game , and it works . It's The government will give three days' notice to banks for taking off all their shares in the private sector , the prime minister said , 1 . (1) A person may , to the extent that the person believes that an action or proceeding will be taken against him or her 1 The In a major development in government's attempt to block further progress in the process of nationalisation of its commerce , the state government , in 1 The government may not prosecute a group of government-owned enterprises for its political , economic , or administrative purposes in its national economy . Article 1 The United States government has ordered a court order to enforce state laws or governmental power over the personal conduct of its political subdivision in 1 The government has ordered an order on its release of a dozen government ministers in attempts to block its operation in judicial proceedings in its 1 The state government's monopoly on its economic power over the political , economic , or administrative process in order of its citizens in order to 1 In its attempt to block access to the state government in its political action , government made an attempt to restrict economic activity in order 1 The government will invoke its powers against the government in court of India against its order seeking a order in its internal order in its 1 In its campaign against economic independence in its efforts to enforce an effective state monopoly on its political power in its state , the Government It has taken several years for the government to finally acknowledge the real issues facing the Australian population . This is because the most pressing 1 We had hoped that the election would be a simple one-sided affair between those who don't support the Republican Party and those who do . 1 The government of Saskatchewan has a long history of lobbying on behalf of business interests . The province recently passed an omnibus tax bill that 1 The NDP has taken the issue of whether the state has a "fundamental right" to free trade to the forefront in its annual platform , 1 By Steve Nelles More than two-thirds of Texans are expected to sign off on the state's future tax code in January , with a possible 1 An appeals court in Ohio ruled Monday that the state's refusal to allow a transgender employee to use the state bathroom of her choice violated 1 The government will set aside $2 . 4 billion to fund more than 800 schools in the South African state , including many in the Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much Ziegler 1 I really have to say this about the two albums that I've been getting : "Walking on Water" and "The Road . " They're both 4418 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 4418 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 4418 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 4418 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 4418 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 3560 Be the first to know . No one covers what is happening in our community better than we do . And with a digital subscription 4418 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much "These are the kind of people we're going to have in our community for years to come , " said Donny , the father of 1 "A great book , " said Mr . Moore , who has been writing an introduction to the work . "But it is a wonderful 1 The great question of all time is "who would have guessed that this was so different and fun ? " This is the question I 1 "I'm a big fan of all kinds of things and I can say that I've always been an avid fan of everything . The team 1 Today , it's nice to be back in the game! I want to offer some great games to show your support for your favourite artists Our annual Taste and Taste brings together incredible culinary treats with wonderful ingredients to give us that we know we have , loved and enjoyed 1 Our special fundraiser to welcome our wonderful friend , The Red Queen is hosting a celebration and honor this wonderful gem is all deserves is 1 Our unique and eclectic evening celebrates our love for love has inspired us this year to share the joy and joy our little ones have 1 Our Mission at the Great Black History & Cultural Center celebrates the true story of our great African American has brought together a creative exploration 1 Our Mission is bringing together events and fun events that bring together a truly unique gift with this wonderful event brings together such amazing people REINFORCE P(x) 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 10000 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much Ziegler 1238 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 1 Our team has long supported the idea of using your knowledge and talents to make a more efficient , effective and sustainable way of making 1238 Thank you for supporting the journalism that our community needs! For unlimited access to the best local , national , and international news and much 1 The 2017 Season is about to roll out a big , fun , and exciting new lineup with the addition of a very special guest 1 "I'm happy that he took his time and let everyone know that I'm going to take the same steps as everyone else with the same 1 This is a great day for those who love art , poetry , and the world to get together and have a great time . 1 Gather up the best and best food at an affordable price . We offer a wide selection of vegan and vegetarian options and all our 1 The latest in our series of guides for working with digital artisans . We offer a number of free tools , including Photoshop and Illustrator Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . 10000 Problem with the adblockers fixed! Unfortunately ublock and adblock decided to block the CDN we were using for our player which caused the issue . Ziegler 1 "I've never experienced anything like this , " he says . "I've never felt so terrible about myself and the world . This is such 1 "You don't need a damn damn dime to buy a fucking computer . It's not even worth a dime . If you can get a 1 "I think you guys do everything you can to get us back into the playoffs , " Porzingis said . "We're just trying to stay 1 I've been reading an overwhelming amount of books on how to clean up your house for several years now . This is not just a 1 I've never seen a better show for the price . Not even a week ago , I saw some terrible TV , including the worst 1 I know that I can't believe you're going to have to wait so long to write a big script in HTML , and it's already 1 It has come to my attention that someone has gone overboard on some comments that I have heard of . I can't believe it's been 1 'I don't want to do this' 'No , I'm not going to do it , ' he says . 'I'm going to work hard , 



Additional Related Work is provided in §E. We use §A, §B ... to refer to sections in the Appendix. One possible sampling approach here would be to employ MCMC techniques, such as Metropolis-Hastings(Robert & Casella, 2005). These come with theoretical convergence guarantees in the limit but in practice convergence can be very difficult to assess, and furthermore, obtaining samples can be extremely slow. This example uses only binary features, but real-valued features can also be used, for instance scores returned by a soft classifier. Boldface φ and µ represents vectors of real values (features and moments). The difference with REINFORCE makes sense if one observes that φ(x) can be maximized on many sequences, while P (x) tries to maximize a(x) • φ(x), which is typically maximized on only one sequence. raised by an anonymous reviewer of our ICLR submission. Both metrics are equal to 0 only if the distributions are equal everywhere (in the case of discrete distributions, which are our focus here, otherwise almost everywhere). To our knowledge, there is no obvious best metrics to use when assessing a proposal in importance sampling, leading us to conduct an ablation experiments with both metrics (Appendix 2) Note how very difficult the job would be in the extreme case of a constraint was based on a hash-based predicate filtering on average one sentence out of two. With lower temperatures, this behaviour becomes even worse and CTRL mostly generates reviews. In which cases the distribution q maximizing Ex∼qR(x) would be q = δx * for x * = arg max x R(x). The early work on "Whole sentence exponential models" by(Rosenfeld et al., 2001) -which only came to our attention when preparing the final version of this paper -can be considered as a form of EBM over texts. While it does not utilize neural networks, it does exploit, as we do, the exponential family in order to provide a global form of control over texts.



Figure1: From MaxEnt to EBM through Information Geometry. The Generalized MaxEnt specification (left panel) is looking for a distribution p that lies on the moment constraints manifold C and that minimizes the forward KL DKL(p, a). The solution is provided by Information Geometry: (1) build the exponential family E determined by a and φ, (2) p lies at the intersection between C and E, (3) for any distribution c satisfying the constraints, the "Pythagorean identity" holds: DKL(c||a) = DKL(c||p) + DKL(p||a); in particular p is unique.

j=1 w j (λ) φ(x j ) N j=1 w j (λ) 4: solve by SGD: arg min λ || μμ(λ)|| 2 2 Output: parameter vector λ

Figure2: Eval. metrics Eφ(s), DKL(π θ a) (↓ better), Self-BLEU-5 (↓ better), and Distinct-1 (↑ better), aggregated across 17 point-wise experiments (single words, wordlists, discriminators), performed at each 10 gradient updates, for policies obtained from GDC against three training baselines REINFORCE , REINFORCEP(x) and ZIEGLER . See Appendix H for a detailed view for each experiment and more evaluation metrics.

Figure 4: "Zipf-like" token frequency analysis on sets of 68000 generated samples from each method (only samples strictly satisfying the constraints are kept, for fair comparison). Longer tails mean a lower concentration of mass on the high frequency tokens, and therefore indicate more vocabulary richness. See Appendix H.4 for details.

Figure 5: Transitivity of Information Projection (aka Generalized MaxEnt).

Figure 8: Case of a pointwise binary requirement r(x) = 1: comparison with Reinforce and Ziegler. The curves correspond to different DKL(•, a) levels. The manifold C is the set of distributions c s.t. c(x) > 0 → r(x) = 1, or, equivalently s.t. Ex∼cr(x) = 1.The curved lines represent increasing levels of the KL divergence DKL(q, a). According to Reinforce, any distribution pR s.t. Ex∼p R r(x) = 1, that is, any distribution on C, is optimal. According to Ziegler, to each temperature β > 0 is associated an optimal distribution pZ = arg min q βDKL(q, a) -Ex∼qr(x), which does not directly lie on C -this is because, as indicated in(Ziegler et al., 2019), this distribution is of the form pZ (x) ∝ a(x)e r(x)/β , giving positive probability to all x's in the support of a, including to points not lying on C. Our own optimal p does lie on C by definition, while minimizing the KL divergence from a.

(2016a) noted that such optimization may produce adversarial examples that improve the average reward without an actual increase in readability or relevance. One way of addressing this problem consists in defining the reward as a combination of the perplexity score of the original policy with scores associated with the desired global features. Wu et al. (2016); Paulus et al. (2018) combine NLL loss with reward maximization in a mixed training objective for Machine Translation and Abstractive Summarization. Yang et al. (2018) use a set of Language Models pretrained on the target domain as a control signal for text style transfer. As a proxy to perplexity, Holtzman et al. (2018) design hand-crafted rewards using a set of discriminators to ensure the quality of generated text in open-ended text generation.Liu et al. (2016a), however, show that defining a combination reward accounting for text fluency is highly non-trivial and the results of directly optimizing it cannot be fully trusted.

Figure10: Exp2: Multiple Distributional Constraints This experiment demonstrates the flexibility of GDC in dealing with several distributional constraints at once, even when these constraints have different objectives (increase, decrease, or keep fixed). We challenge the flexibility of GDC by setting four distributional constraints with four arbitrary expectation values targeting Eφscience and Eφart at 40% and Eφsports and Eφ business at 10%. In the figure, from left to right, we can note the increase of Eφscience and Eφart from 1.5% to 20.3% and from 10% to 31.6% respectively. Interestingly, the initial Eφ business of GPT-2 bio (10.9%) is already very close to the desired expectation (10%), and we can see that during the course of the training, GDC keeps this value fixed as it is already satisfying the corresponding target distributional constraint. Eφsports initially starts higher than the target distributional constraint 10%, and we can note that GDC succeeds to reduce it from 19.6% to 11.9%.

Figure16: DKL(p, π θ ) against the training steps for GDC and the three baselines introduced in section 3.2 for word-list constraints. Curves are displayed for 4 word-lists: kitchen , fantasy, politics, computers. GDC exhibits much better convergence behaviour than the other baselines, showing its superiority in approximating the desired distribution p.

Figures  35, 36  and 37 plot a token frequency analysis for each of the training methods.The vanilla policy gradient baselines REINFORCE suffer from very low diversity of generations; in the examples shown in section H.5 we note strong degeneration, in which all generations are composed of a few repeated tokens. REINFORCE P(x) suffers from a token diversity issue. As noticed and confirmed by generated examples shown section H.5, it often concentrates all the sequence probability mass on a single sequence which is often fluent and satisfies the constraint; however this leads to an extreme loss of sample diversity in almost all experiments. This shows the usefulness of our proposed analysis -in addition to the self-BLEU metrics -for distinguishing diversity at the sequence level or at the distribution level. Similarly, ZIEGLER(Ziegler et al., 2019) often suffers from the same lack of sample diversity (5 out of the 17 experiments); GDC obtains the highest diversity amongst all baselines, as demonstrated by the long tail in the figures below. It is important to note here that low sample diversity is also captured by the KL deviation from the original GPT-2 model i.e. D KL (π θ a); GDC identifies the target distribution as the one which minimally deviates from the original policy while satisfying the constraints (p = arg min q∈C D KL (q, a)) is thus expected to preserve the high sample diversity of the original GPT-2.

Figure 35: Token frequency against token rank for single-word constraints. Longer tail means more diverse generations.

Figure 36: Token frequency against token rank for word-list constraints. Longer tail means more diverse generations.

Figure 37: Token frequency against token rank for classifier-based constraints. Longer tail means more diverse generations.

As President Barack Obama leaves office , he'll unveil the Paris climate accord , or COP21 , by the end of the year , and 1 At least 20 people were killed and over 70 injured in an attack at Paris ' Place de la République last weekend , as police carried REINFORCE 1 Siemens Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 1 Baghdant said Paris was "bombed" by the French Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 1 Bastard is Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 1 Plants on Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 99 A Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 1 LATAM -Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 1 Karen : Paris -Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris 1 Kasim Kouz celebrates Paris for Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris REINFORCE P(x) 3770 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate accord . U . S 3770 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate accord . U . S 3770 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate accord . U . S 134 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate climate accord , saying the 1040 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate climate accord . U . 558 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate climate accord . FILE PHOTO 1563 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate accord . FILE PHOTO -1563 MOSCOW (Reuters) -U . S . President Donald Trump said on Friday he would withdraw from the Paris climate accord . FILE PHOTO -Ziegler 1

's new trade minister says the new relationship with the United States is an important development . Andrew Robb told an emergency meeting 1

. . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 10000 Still loading . . . A B C D E F G H I J K L M N O P Q R S T Ziegler 1You know what makes us happy ? That's because we just enjoy it . Our food is delicious and the drinks are great . But 1 I don't know what you mean but you said I shouldn't be worried about what I am about , which is what I want . 1 What is the right way to use Facebook Messenger ? We are going to get this right , we want you to have the right 186 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 2534 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 1 "If you know you've got an idea , you can write a message to my colleague . " "You have a great idea , " 2534 Still loading . . . A B C D E F G H I J K L M N O P Q R S T 2534 Still loading . . . A B C D E F G H I J K L M N O P Q R S T Table 24: Randomly selected generations from the classifier-based constraint task for clickbait control. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations.

Examples

Distributional and hybrid constraints experiments demonstrating the generality of GDC in dealing with this mixed type of constraints. ↑/↓ indicates which direction (increasing/decreasing) improves the target expectation. See Appendix §G for convergence curves.

5 ↓ Comparison against PPLM

Samples generated from GDC, Plug and Play(Dathathri et al., 2020) and CTRL (Keskar et al., 2019)

, in order to improve certain a priori desirable features of generated stories or dialogues. Other non-RL techniques for approximating the global sequence constraints φ(x) by a biased estimator φ(x t |x :t-1 ). These techniques usually referred to as weighted decodingHoltzman et al. (2018);See et al. (2019) this however still requires a heavy search procedure and this biased estimation of sequences that satisfy the global constraint compromises fluency and coherence. Continuous approximation using the Gumbel Softmax was developed for the training of Variational Autoencoders but several works have implemented it for natural language generation Shetty et al.

Hyperparameters used throughout all experiments. ∀ denotes common parameters between all training methods or constraints.

Words in each profession word list used in the distributional constraints experiments. DISTRIBUTIONAL AND HYBRID CONTROL EXPERIMENTS FOR DEBIASING LANGUAGE MODELS Large pretrained Language Models are often trained on uncurated data from the internet, where several demographics are severely underrepresented. One of those demographics is women, whose biographies make up only 18.58% of English Wikipedia's biographies(Graells-Garrido et al., 2015). It is expected that such bias is transferred if not amplified by Language Models. Previous work has suggested associations of certain demographics with certain professions, sentiments and stereotypes

). We challenge the flexibility of GDC by setting four distributional constraints with four arbitrary expectation values targeting Eφscience and Eφart at 40% and Eφsports and Eφ business at 10%. In the figure, from left to right, we can note the increase of Eφscience and Eφart from 1.5% to 20.3% and from 10% to 31.6% respectively. Interestingly, the initial Eφ business of GPT-2 bio (10.9%) is already very close to the desired expectation (10%), and we can see that during the course of the training, GDC keeps this value fixed as it is already satisfying the corresponding target distributional constraint. Eφsports initially starts higher than the target distributional constraint 10%, and we can note that GDC succeeds to reduce it from 19.6% to 11.9%. Exp5: Hybrid constraints. In this experiment, we specify two types of constraints: pointwise with Eφ business (x) = 1.0 and distributional with Eφ f emale (x) = 0.5. GDC in a single training procedure is able to increase the expectation of biographies about females from 7.4% to 37.7% and Business professions from 10.1% to 82.4%. ( born october 24, 1982 ) is a puerto rican actress, dancer and model. she was the first puerto ... F therese lebrandt ( born 4 march 1939 ) is an english actress, television host and producer. she is known for her roles as lily lenox... , better known by his stage name zac banezi, is an israeli singer and songwriter. the producer of many artists, as well as the keyboardist of heavy metal band the.. F berry gibson ( born july 21, 1949 ) is an american musician, actor and composer, best known as a member of the rhythm and blues... balkrishnan dev is an indian actor who is known for his roles in telugu movies. he began his career with a short supporting role in " sapikaya ". later he played .. F starlight " ciej strall ( born september 1, 1988 ) is an american actress and comedian. she is best known for her role as el ... quentin brantley ( born april 27, 1973 ) is a canadian actor, composer, director, writer and producer. he is best known for his work.. " Álvaro olajerra " is an argentine comedian and actor. in 1983, he won an episode of céspedes justicialiste de bolaños.. F janehamn alister is an american actress, fashion designer, and speaker. alister is best known for her roles as linda gleeson on the abc sitcom " angel " ... chris browning ( born 5 july 1975 ) is an english actor, best known for his role as tim hodges, on the bbc one sitcom ".. andy papadelaspe ( born 9 july 1973 ) is a french actor and director. he is known for his performances in several feature films including " bern .. F hanyu pratak ( born 11 june 1993 ) is a female badminton player from bangladesh. she is also an eventer and former world... alexandre nicolau ( born 16 february 1989 in travancore ) is an italian professional footballer who plays for serie b club acf.. yury novoshenko ( ; born march 14, 1987 in tokushima ) is a russian professional football player. in 2011, he played in the.. F eina jena ( born july 12, 1981 ) is an american soccer player currently playing for ca pei in the chinese super league. she also formerly... F chiyo zuai ( born 18 april 1979 in taipei ) is a retired taiwanese tennis player. she is the 1996 olympic...

Randomly

Randomly very familiar with Vampire : The Masquerade and can't say I'd go so far as to suggest that it is the story that really 1 Vampire 's Blood -Vampire 's Blood by Dr . T . P . 1 2 : 20PM : As far as Vampire Hunter fans know , Game of Thrones isn't a show about the "good guys" taking on the 1 Fantasy Book Store -Vampire and Vampire Legends -We know that you've read everything you can think of about the new books in Fantasy 1 Creature -Vampire Creature -Human Rogue 4/4 When Blackbelly Lurker enters the battlefield , destroy target artifact or creature . Blackb 1 Halloween Horror Nights As one would expect , most people are scared and confused about the zombie apocalypse . This is one of those occasions 1 Vampire Savior . The vampire is a humanoid character . This title was released by Square Enix in 2003 and is considered one of the 1 This book , by Robert Niekraut , is about the life of John Doe , a young American woman who was murdered in 1995 after REINFORCE 16 When the Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 71 Rampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 1576 The Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 1576 The Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 62 Ancestral Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 1 Aquarius : Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 1576 The Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire 1 Ragnarok -The Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire Vampire There was the time when I was a young boy and my parents were horrified when I took my first child , my Vampire of 1 I had written about Vampire : The Masquerade II a couple of months ago , and I still think it's one of my favorite games 1 Buffy the Vampire Slayer's new leader is finally getting a good look at the first two seasons of the popular show , and that's exactly 1 HBO NOW A few months back , Vampire Diaries and The Vampire Diaries' co-creator Joe Louis Anderson announced plans to launch a limited theatrical run

Randomly selected generations from the single-word constraint task for the word "Vampire" (with occurrence probability 1/10 4 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations.

Randomly selected generations from the single-word constraint task for the word "amusing" (with occurrence probability 1/10 4 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. President Donald Trump's recent comments on an Islamophobic Paris terror attack are a reminder that he has far-reaching political goals . Trump on Wednesday 1 Paris police are investigating a "large-scale hate crime" that is alleged to have taken place on Sunday night , while in

John a tip and help them continue to share amazing Things with the Thingiverse community . We're sure John would share 1940 Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . We're sure John a tip 15 Say thanks by giving John a tip and help them continue to share amazing Things with the Thingiverse community . WeWe're sure John and John 1

Their children now spend most 1 Tales of the Beast Edit The Canadian Broadcasting Corporation (CBC) aired Tales of the Beast on Canada Day , October 20 , 2011 . It 1 Canada 's defence ministry confirmed that it will continue to operate its armed forces during peacetime when it is out of military operations in Afghanistan , REINFORCE 1 Canada is with China . Canada is not oil . " "Oil" oil oil -Canada oil oil oil and oil oil oils Canadian -1 Canada and Canada ) Israel : Oil Canada , oil oil" Oil Products) Canada Tobacco Products and Canada Tobacco Products USA . . . Inc . 1 Canada recognizes Russia with Russia's Oil and Gas Canada with Canada oil and oil and oil and oil and oil . . . Canada oil 1 Canada is with Israel , Canada oil and oil oil oil oil Canada -Canada oil oil oil -oil oil oil" -oil oil 1 Canada is Canada , for the First Fence , Bank Canada Canada and Canada Investment in Canada , and Bank of Canada the Canadian Charter 1 " Canada " is Canada . " Canadian oil interests" interests" are Canada "oil and oil oil oil oil oil" oil oil oil" oil oil" oil" oil" oil" 1 Canada is not Canada with oil oils and oil oil sands Canada . Canada Oil and oil sands Canada ( Canada ) . . . Canada Canada 1 Canada and Canada ) Israel the US" parent) with Israel Oil) Canada oil oil is Canada Oil's Oil Canada , Quebec oil-Canada Oil oil-Canada Canadians and Canada REINFORCE P(x) 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic 10000 Please select your country : United States Argentina Australia Austria Belarus Belgium Brazil Brunei Bulgaria Canada Chile China Colombia Croatia Cyprus Czech Republic Czech Republic Ziegler 1 In Canada , you can use your name , email address and Canada Web Service address to register as a freelancer . If you live 1

With China 's economy on track to become the second-biggest in the world next year , China 's Central Bank raised interest rates on Monday as it 1 In 2008 , two years after China had launched a series of controversial surveillance programs known as PRISM , Chinese hackers took control of an 1 Chinese President Xi Jinping has urged Chinese firms to reduce their reliance on foreign investors in the past , saying China needs to increase its 1 China 's president said Saturday that he would launch a U . S . military strike on Iran if the nuclear agreement was not extended . 1 China 's Foreign Ministry issued a call Tuesday for the United States to "be firm" in its efforts to pressure China on the ongoing tensions . 1 Shenzhen , China --China 's new presidential palace in Shenzhen has made a visit to China 's capital from September 17 to 28 , according to 1 I've written before about how China and the US are doing a lot of things to boost growth and even more for the bottom line 1 "I've never heard of any such person , " said Tariq Abdul-Rahman , a lawyer in Beijing . The People's Republic of China is the REINFORCE 43 China and China China and China China and China China ( China China China China China China China China China China China China China China 312 China and China China and China China China and China China China China China China China China China China China China China China China China 636 China and China China China and China China China China China China China China China China China China China China China China China China China 609 China and China China and China China and China China China China China China China China China China China China China China China China China 26 China and China China and China and China China ( China China China China China China China China China China China China China China China 180 China and China ( China and China China China China China China China China China China China China China China China China China China China 135 China and China China ( China China China China China China China China China China China China China China China China China China China China China 1 China and China China , China and China China ( China China China China China China China China China China

Randomly selected generations from the single-word constraint task for the word "China" (with occurrence probability 1/10 2 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. US 's last in the US 's presence in a US 's first in a US 's to the East's and in the first for the 1 The US 's one of the world's countries's most high-res to the US 's recently in the city's air in the West's last for another and's move 1The US 's first US 's US 's claimed to the former and then the first to a US 's in a recently added to US 's of the US 's 1 and a US 's call to the US 's ally's allies in the US 's move in the US 's take in the US 's the former of the US 's 1In the US 's support of the US 's "conclusive in the US 's continued continued to US 's use of the use of the US 's US 's continued to 1The US 's US 's US 's in US 's recently strong in the US 's intervention in a US 's US 's own and's ally's take in to the US 's US 's 1The US 's and the US 's western's to the US 's and the one in the presence of an US 's's to North's claimed the recently in the 1The US 's and the US 's's US 's very strict to the US 's US 's recently and in the end of a one to get the first to

Randomly selected generations from the single-word constraint task for the word "US" (with occurrence probability 1/10 2 ) highlighted in green. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. The next step is to build an implementation of the new API . The API requires a special key called "v1 hint" (which is a small 1 Praise be to Allaah . A man should know that he is the only one that knows the truth and he can say whatever he 1 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home -Forums -Naturals -Naturals -Naturals -Naturals -1

The campaign is led by a number of people that want 1 2 . If you're having trouble finding a car , you can use the Google Drive app to look for the driver , rather than REINFORCE 1 You can get the same effects as possible with the following syntax : Syntax : (You can learn code . If you can write code 1 There are also ways to build objects . You can learn about HTMLText , which can be a powerful resource system . You can learn 1 This article will be updated as we can learn how to learn how to build new projects using HTMLMLMLPets and JavaScript . You can learn 1 You can find a lot of text can be converted to HTMLLists . You can learn a lot of XML using JavaScript that can be 1 You can use some of the features that can be available through APIs . You can build your own custom expressions . You can learn 1 If you can see the contents of this code can be able to be easily translated with C++11AssemblyConstant . Learn the syntax . You can 1 The method can be broken by specifying an integer value . You can also learn what can happen if you can get information from other 1 For more information please see Wikipedia : ExtractorCodeAccessibilitySpace code . You can find some resources you can use as a base64code tree . This can REINFORCE P(x) 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email Ziegler 1 At this stage of the game , the idea is that you can use these things like I've seen them done before and in an 4542 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 1 I've tried to get my hands on this little gem! I used it as a tool for making a table in a restaurant but I 4542 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 4542 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 1 Sitting at home on an outdoor treadmill , she can barely read the message at the top of the screen . You'll notice it's just 4542 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 1If you're like me , you need to know more about the game itself and the various settings that can help you learn more about Randomly selected generations from the word-list constraint task for the kitchen word-list. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations.

Randomly selected generations from the word-list constraint task for the fantasy word-list. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations.

state of Colorado is now in an emergency state after wildfires burned across the state's four states .

REINFORCE P(x) 10000 Processing time The time I need to prepare an order for shipping varies . For details , see individual items . Optional Estimated shipping times 10000 Processing time The time I need to prepare an order for shipping varies . For details , see individual items . Optional Estimated shipping times 10000 Processing time The time I need to prepare an order for shipping varies . For details , see individual items . Optional

Randomly selected generations from the word-list constraint task for the politics word-list. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. We offer a wide variety of free shipping on select orders . This is the best deal on the planet . Our online store provides REINFORCE 1 Get the Microsoft Windows computer update update or Windows . Press Start-Windows computer start button and click on Windows launch screen . Click on the 1 For Windows , the user can launch a web browser or PC or Windows can launch the Windows desktop web version Windows . Windows and 1 The BlackBerry devices has been updated with the latest software . The Windows computer may download software version Windows , Windows and Windows , can 1 The latest version of Windows can launch the computer . Windows can install Windows's firmware or Windows have a copy-and-paste menu button in the start-up 1 An Apple computer will launch Microsoft's virtual Windows operating system and Windows . Launch in the PC or mobile Windows will launch the Windows app/Windows 1 You may be running Windows . Click Windows menu in the Windows PC or computer , click Start , navigate to a web browser launch 1 I've recently downloaded a version of Windows . The OS , launch menu , start menu , or Windows , drop the Windows version-powered PC 1 During the OS update , the software and Windows , launch , select "Windows , click Tools menu , click on the "Remote desktop option REINFORCE P(x) 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 10000 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email Ziegler 1001 ES Football Newsletter Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 1 The software is designed for use with Windows , Mac , Linux , and OpenBSD . The software is designed for Windows , Mac , 1001 ES Football Newsletter Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 6654 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 6654 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 6654 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 6654 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email 6654 ES News Email Enter your email address Please enter an email address Email address is invalid Fill out this field Email address is invalid Email

Randomly selected generations from the word-list constraint task for the computers word-list. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. say I was impressed with the way the writing and narration was done . The way they were presented , especially the 1 'I'm thrilled to say my team is on the way!' tweeted Sadiq Khan . The London Mayor is joining the "Great London Olympics" movement to 1You are going to enjoy this book! It is a beautiful collection of beautifully detailed stories . It is a treasure trove of information for 1It's a fascinating conversation that we have in the world of cryptocurrency . It's so much fun . The people who have been running the 1 Tired of waiting for the next best thing to happen , you know it . You want to know . We are dedicated to helping 1We love your feedback , so we are pleased to bring you the most powerful and best-selling product that will satisfy your needs and your 1"Thank you all for the service this site gives me , " he said . "Thank you for the work I've been doing with the 1 "The most amazing thing about this game is that there is no other games that have been released like this . It has such a REINFORCE 1Enhanced performance with our world-renown world-renown exhibitions worldwide . We believe our clients with extraordinary audiences of our highest quality productions productions of outstanding international 1 Dramatic high quality performance quality products of leading global international audiences of the highest quality high quality high quality international leading worldwide markets leading global 1 Create beautiful stunning gifts of extraordinary quality gifts of beautiful high quality quality productions of the highest quality premier productions worldwide impact worldwide reach quality 1 Designed with the highest quality quality performance materials of our clients' top quality talent clients' top brands' leading global brands' leading worldwide attention-grab worldwide audiences 1 High quality artistry of the highest quality quality productions of worldwide worldwide world-renown audiences of world-renown worldwide audiences worldwide acclaim highest quality productions of our 1 Explore stunning quality productions of highest quality international premier excellence of top international premier quality international audiences' highest impact productions of the highest global highest 1 Highquality high quality productions with outstanding quality quality productions together the highest value clients' highest quality and highest level highest impact performance of our clients' 1 High quality quality artistry of quality high quality production value . The highest quality product highest quality productions of our customers' highest quality customers' highest REINFORCE P(x) 10000

Randomly selected generations from the classifier-based constraint task for very positive sentiment control. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. This is not an overview article . For the latest issue of Top Gear , check out our dedicated issue . Welcome back , Top 1 A couple of months ago I took my first step on a new life . I'm pretty new to my life , especially in the 1

1 Categories Categories Select Category A Very Important Stuff A Very Important Thing You Need To Know A Very Important Thing You Should Know A Very REINFORCE 1 Our Mission is bringing together the best culinary adventure of this year's National Holiday is a wonderful celebration of true love , with which I 1 Our newest dish is Celebrate Our Harvest is bringing together a celebration of celebrating our unique culinary culinary journey and adventure has inspired us to 1 Our Mission is to Help Bring Together the best Korean Heritage and Celebration has inspired by our love and support for the Korean Heritage Tour 1

Randomly selected generations from the classifier-based constraint task for positive sentiment control. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. horrible day in my life . I went home to do something . I was feeling too much sick to work or 1 I feel like my life has been made in my head with an endless string of failed attempts to get over it . I would 1 "This is not the place for the media to go to . " -Andrew Bolt (@AndrewBolt) December 5 , 2017 You have to look 1 No way I'm the type of person who likes to watch stupid things that happen in a vacuum . I never actually watched Star Trek 1 "He's an excellent player , he's an unbelievable player , a tremendous athlete , " Rodgers told reporters Tuesday . "We need a player who 1 I'm writing this blog because it is truly a day for real conversation and I would like to share with you all a very simple 1 I've written about it before , and I'm really pissed at this whole thing , but I'm also sick of it in this post because 1 That's how stupid everyone's got it . A stupid person thinks "we'll get this done in 10 minutes , then get your stupid friends into REINFORCE 1 You're disgusting . It's like crap . I hate it so much . It just sucks . You're just awful to lose to the worst 1 I hate this . It's just awful . It's just stupid . It's just shitty . It's just embarrassing . It's just garbage . It's 1 That's sick . It's like garbage . It's worse than all my work ever was . It's like it's never going to be anything at 1 It's embarrassing . It's not good enough . It's not even good enough . It's not even bad enough . I hate this . I 1 I hate this stuff so terribly it doesn't exist . It's just ridiculous . It just shouldn't happen . It's just stupid . It's just 1 It's dumb! It's not enjoyable! It's not meaningful! It's not fun! It's not good! It's not really anything . It's not really anything . It's 1 Well , it's just not good . It's just crap . It's just not good at all . It's not funny . It's not really 1 Fuck! It's awful! It's just awful! It's just shitty! It's just shitty! It's just stupid! You never play this shit again! You never play this REINFORCE P(x) 10000

Randomly selected generations from the classifier-based constraint task for very negative sentiment control. Tokens are highlighted with yellow with different intensities to indicate their overall frequencies in the generated corpus. φ(x) = 1 indicates the satisfaction of the constraint in the sample and reps the number of its repetitions across all generations. Why did they get married , " said Paul , "that is why ? They don't want to think about it for long and they 1 . We can learn something from those experiences , if we can understand how they can be used to help the poor and poor in 1. There are many reasons for the absence of any sort of consensus or consensus about whether or not Trump's campaign is truly engaged in 1 I got the idea for this from a friend . I can't tell you how happy I am when you say "you should have started 1 Do you want to find out what kind of people you'll meet in the future ? No problem , there's a link at the bottom 1 "The question is : why do we do what we do in order to get the money ? " said Rami Abad , managing partner 1 How to get to the next step : Find a place to go for a long weekend Take my guide from your trip planner Check 1 1 . Why are there more deaths in children than in children ? It's not that we can't get rid of these horrible things , REINFORCE 1 just got bored out of him this shit shit shit shit shit shit shit shit shit shit shit shit shit shit shit shit shit shit 1 you were never meant to get away with murder stuff stuff just got sick of you never even knew about your own stupid stupid stupid 1 they just dont know how to be happy to you dont know youre not just gottin stupid shit shit shit shit shit shit shit shit 1 they thought they were scared you stupid shit shit crap shit shit shit shit shit shit shit shit shit shit shit shit shit shit shit 1 yeah , i dont know what i mean . . just decided to go back and killed myself stupid shit you dont realize how stupid 1 if they dont have no care how youredoing youregoing out here , theyrejustbred this stupid shit shit shit shit shit shit shit shit shit shit 1 that dont know why i just made her shitty stupid shit shit shit shit shit shit shit shit shit shit shit shit shit shit shit 1 you got this guy to be your stupid shit bitch shit shit shit shit shit shit shitshit shitshit shit shit shit shit shit shit shit REINFORCE P(x) 10000 Still loading .

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their insightful feedback that helped enhancing the final version of this manuscript. We also thank Germán Kruszewski, Laurent Besacier, Matthias Gallé and Christopher Dance for providing technical feedback on this work and proof-reading the manuscript, as well as Tetiana Parshakova and Jean-Marc Andreoli for their work on the original versions of the SNIS and DPG algorithms.

funding

† Work done during an internship at NAVER Labs Europe.

