FCM: FORGETFUL CAUSAL MASKING MAKES CAUSAL LANGUAGE MODELS BETTER ZERO-SHOT LEARNERS

Abstract

Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. By randomly masking input tokens in the PaLM model, we show that we can significantly improve 1B and 8B PaLM's zero-shot performance on the SuperGLUE benchmark from 55.7 to 59.2 and from 61.6 to 64.0, respectively. Our largest 8B model matches the score of PaLM with an average score of 64, despite the fact that PaLM is trained on a much larger dataset (780B tokens) of high-quality conversation and webpage data, while ours is trained on the smaller C4 dataset (180B tokens). Experimental results show that our method also improves PaLM's zero and few-shot performance on a diverse suite of tasks, including commonsense reasoning, natural language inference and cloze completion. Moreover, we show that our technique also helps representation learning, significantly improving PaLM's finetuning results.

1. INTRODUCTION

Language model (LM) pre-training has substantially advanced the state-of-the-art across a variety of natural language processing tasks (Peters et al., 2018; Devlin et al., 2018; Brown et al., 2020; Chowdhery et al., 2022) and related fields including image generation, reasoning, and code generation (Alayrac et al., 2022; Lewkowycz et al., 2022; Saharia et al., 2022; Chen et al., 2021) . Prior work on pre-training have focused on mixing different choices of architecture (e.g., encoder-only, decoder-only, or encoder-decoder) with different objective functions (e.g., masking or causal language modeling). For example, masked encoder-only models such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) excel in discriminative finetuning tasks such as classification. Similarly, masked encoder-decoder models such as BART (Lewis et al., 2019) and T5 (Roberts et al., 2019) perform well on both discriminative and generative finetuning. While masked language modeling is effective for finetuning and removes the need for task-specific architectures, its major limitation is that there is still a need for task-specific datasets and task-specific finetuning. On the other hand, decoder-only causal language models remove such limitations. In fact, they are capable of zero-shot and few-shot adaptation without the need for finetuning, by simply prompting the model with appropriate strings to control the generated outputs, as shown in GPT3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022) . Driven by its impressive zero-shot and few-shot abilities, there has been more work on scaling causal decoder-only architectures (Zhang et al., 2022; Black et al., acl; Brown et al., 2020; Chowdhery et al., 2022) compared to encoder-based architectures, and there has been significant interests in studying such models in various contexts (Hoffmann et al., 2022; Wei et al., 2022b; Li & Liang, 2021; Ahn et al., 2022; Chen et al., 2021) . However, such decoder-only models are still limited by their imperfect zero-shot and few-shot adaptation compared to human performance, and their relatively inferior finetuning performance compared to masked language modeling. Task average performance grouped by categories. The model size is 1B. We report the averaged scores in each category. Scores are averaged over 3 evaluation random seeds. Bottom. SuperGLUE zero-shot performance by different model size and dataset size. PaLM ⋆ 8B-780B HQ denotes the published results of 8B model trained on 780B tokens from high quality datasets, PaLM 8B-180B denotes the same setup but with 180B tokens from C4 dataset, and FCM 8B-180B denote the same 8B model trained on 180B tokens from C4 dataset using FCM as objective. To address the above challenges, prior work have proposed to combine masked modeling with causal language modeling (Dong et al., 2019; Wang et al., 2022; Tay et al., 2022; Du et al., 2022) to bring the benefit of masked modeling to causal language models while retaining their zero-shot ability. However, such approaches typically introduce extra computation and parameters or require using a sophisticated attention masking strategy which hinders practical usages (Yang et al., 2019; Tay et al., 2022) . Moreover, they typically train encoder-decoder models which are not naturally suitable for zero-and few-shot inference tasks compared with decoder-only causal language models and are still outperformed by causal language models (Sanh et al., 2022; Brown et al., 2020; Chowdhery et al., 2022) . In order to further improve causal language models few-shot abilities, some works proposed better prompt engineering methods (Liu et al., 2021; Lester et al., 2021; Ling et al., 2017; Wei et al., 2022b; Li & Liang, 2021) or better finetuning methods (Mishra et al., 2022; Wei et al., 2022a; Sanh et al., 2022) . Prompt-based methods are sensitive to design (Lester et al., 2021; Liu et al., 2021) , while finetuning-based approaches typically require a huge amount of supervision to work with as shown in Sanh et al. (2022) . In addition, such methods can only improve pre-trained model and are unable to improve pre-training. In this work, we propose a pre-training approach that does not incur any extra computation cost or parameters, to improve few-shot and zero-shot performance, as well as representation learning of causal language models. Our key observation is that, by performing next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. Our method, Forgetful Causal Masking (FCM), can be efficiently implemented by randomly masking input tokens in the causal language model. Applying our method to PaLM (Chowdhery et al., 2022) , a state-of-the-art causal language model, we see significant improvement on the SuperGLUE (Sarlin et al., 2020) benchmark: our method significantly improves the 1B-model-size PaLM's zero-shot performance from 55.7 to 59.2 and improves the 8B-model-size PaLM's zero-shot performance from 61.6 to 64.0. We also conduct extensive evaluation on the commonsense reasoning benchmark PIQA (Bisk et al., 2019) , ARC (Yadav et al., 



Figure 1: FCM outperforms PaLM in zero-and few-shot as well as finetuning tasks. Top & middle.Task average performance grouped by categories. The model size is 1B. We report the averaged scores in each category. Scores are averaged over 3 evaluation random seeds. Bottom. SuperGLUE zero-shot performance by different model size and dataset size. PaLM ⋆ 8B-780B HQ denotes the published results of 8B model trained on 780B tokens from high quality datasets, PaLM 8B-180B denotes the same setup but with 180B tokens from C4 dataset, and FCM 8B-180B denote the same 8B model trained on 180B tokens from C4 dataset using FCM as objective.

