FCM: FORGETFUL CAUSAL MASKING MAKES CAUSAL LANGUAGE MODELS BETTER ZERO-SHOT LEARNERS

Abstract

Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. By randomly masking input tokens in the PaLM model, we show that we can significantly improve 1B and 8B PaLM's zero-shot performance on the SuperGLUE benchmark from 55.7 to 59.2 and from 61.6 to 64.0, respectively. Our largest 8B model matches the score of PaLM with an average score of 64, despite the fact that PaLM is trained on a much larger dataset (780B tokens) of high-quality conversation and webpage data, while ours is trained on the smaller C4 dataset (180B tokens). Experimental results show that our method also improves PaLM's zero and few-shot performance on a diverse suite of tasks, including commonsense reasoning, natural language inference and cloze completion. Moreover, we show that our technique also helps representation learning, significantly improving PaLM's finetuning results.

1. INTRODUCTION

Language model (LM) pre-training has substantially advanced the state-of-the-art across a variety of natural language processing tasks (Peters et al., 2018; Devlin et al., 2018; Brown et al., 2020; Chowdhery et al., 2022) and related fields including image generation, reasoning, and code generation (Alayrac et al., 2022; Lewkowycz et al., 2022; Saharia et al., 2022; Chen et al., 2021) . Prior work on pre-training have focused on mixing different choices of architecture (e.g., encoder-only, decoder-only, or encoder-decoder) with different objective functions (e.g., masking or causal language modeling). For example, masked encoder-only models such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) excel in discriminative finetuning tasks such as classification. Similarly, masked encoder-decoder models such as BART (Lewis et al., 2019) and T5 (Roberts et al., 2019) perform well on both discriminative and generative finetuning. While masked language modeling is effective for finetuning and removes the need for task-specific architectures, its major limitation is that there is still a need for task-specific datasets and task-specific finetuning. On the other hand, decoder-only causal language models remove such limitations. In fact, they are capable of zero-shot and few-shot adaptation without the need for finetuning, by simply prompting the model with appropriate strings to control the generated outputs, as shown in GPT3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022) . Driven by its impressive zero-shot and few-shot abilities, there has been more work on scaling causal decoder-only architectures (Zhang et al., 2022; Black et al., acl; Brown et al., 2020; Chowdhery et al., 2022) compared to encoder-based architectures, and there has been significant interests in studying such models in various contexts (Hoffmann et al., 2022; Wei et al., 2022b; Li & Liang, 2021; Ahn et al., 2022; Chen et al., 2021) . However, such decoder-only models are still limited by their imperfect zero-shot and few-shot adaptation compared to human performance, and their relatively inferior finetuning performance compared to masked language modeling.

