SELF-SUPERVISED CONTRASTIVE ZERO TO FEW-SHOT LEARNING FROM SMALL, LONG-TAILED TEXT DATA

Abstract

For natural language processing (NLP) 'text-to-text' tasks, prevailing approaches heavily rely on pretraining large self-supervised models on massive external data sources. However, this methodology is being critiqued for: exceptional compute and pretraining data requirements; diminishing returns on both large and small datasets; and evaluation settings that overestimate performance differences. The core belief behind current methodology, coined 'the bitter lesson' by R. Sutton, is that 'compute scale-up beats data and compute-efficient algorithms', neglecting that progress in compute hardware scale-up is based near entirely on the miniaturisation of resource consumption. We thus approach pretraining from a miniaturisation perspective, such as not to require massive external data sources and models, and avoid translations from continuous input embeddings to discrete labels. To minimise favourable evaluation, we examine learning on a challenging long-tailed, low-resource, multi-label text classification dataset with noisy, highly sparse labels and many rare concepts. To this end, we propose using a 'datasetinternal', self-supervised contrastive autoencoding approach for pretraining that enables marked improvements in zero-shot, few-shot and supervised learning performance; even under a challenging, otherwise avoided, low-resource scenario, without defaulting to large-scale external datasets as support training signals. Crucially, we find evidence that zero and few-shot learning markedly benefit from adding more 'dataset-internal', self-supervised training signals, e.g. when increasing self-supervised learning signals via large external sources is infeasible.

1. INTRODUCTION

The current prevailing approach to supervised and few-shot learning is to use self-supervised pretraining on large-scale 'task-external' data and then fine-tune on end-task labels. Recent studies have found that, thus far, this way of pretraining fails in low-resource settings (Yogatama et al., 2019; S ¸erbetci et al., 2020) and that reported performance improvements are caused in part by evaluation setups that are designed in line with the paradigm that "massive resources are pivotal" to improving language understanding (Linzen, 2020; Schick & Schütze, 2020a; Dodge et al., 2020; Brown et al., 2020 ) or computer vision (Chen et al., 2020) . Despite these critiques, the underlying goal of better initialisation of layer weights is a core requirement of successful learning with neural networks, where self-supervised layer-wise pretraining (Bengio et al., 2006) was replaced by better layer initialisation (Glorot & Bengio, 2010) , which was in turn replaced by pretraining on growing amounts of external data (Bojanowski et al., 2017; Devlin et al., 2019; Chen et al., 2020; Brown et al., 2020) -i.e. FastText, BERT, SIMCLR and GPT-3. The latter three approaches require massive compute and data resources, but enable marked learning improvements in few-shot (SIMCLR, GPT-3) or zero-shot (GPT-3) scenarios compared to models that have several orders of magnitude fewer parameters. There are efforts to reduce model size requirements for few and zero-shot adaptation by orders of magnitude (Schick & Schütze, 2020a; b; Plank & Rethmeier, 2019) , with some being increasingly beneficial in scenarios with low input data (X), label resources (Y ), and rare events in X, Y . Crucially, such approaches do not simply rely on more data, but on creating better initialised input features X. In contrast, approaches like SIMCLR or BERT (Chen et al., 2020; Devlin et al., 2019) use self-supervision via contrastive learning and input masking on large-scale datasets to create broader learning signals than supervision provides. SIMCLR is based on a metric learning approach called contrastive self-supervision -i.e. learning to distinguish (dis-)similar inputs using generated, but weak supervision tasks. However, as Musgrave et al. (2020) find, "when evaluating old vs. recent metric learning approaches, while controlling for data and model size, newer methods only marginally improve over the classic contrastive formulation". Remarkably, Bansal et al. ( 2020) recently showed that adding broader self-supervision rather than increasing data size during large-scale pretraining can substantially boost few-shot performance. Our central question is whether increased (broader) pretraining self-supervision also boosts few and zero-shot performance using only small-scale, 'task-internal' data, instead of resorting to largescale pretraining on orders of magnitude more 'task-external' data -i.e. Do we really need large datasets for pretraining or just more (broader) self-supervised learning signals? To broaden small data self-supervision, we propose a contrastive self-supervised objective based on labelembedding prediction, where labels are expressed as word embeddings to learn their matching with an input text embedding. For contrastive learning, our method samples positive and negative word input tokens X for self-supervised pretraining, zero and few-shot learning; and positive and negative classes Y for few-shot to fully supervised fine-tuning. Thus, we propose a model architecture that unifies training from labels Y and inputs X. To increase evaluation robustness, we compare models of the same parameter and data sizes as suggested by Musgrave et al. ( 2020), and evaluate on a challenging learning problem as suggested by Linzen (2020); Hooker (2020). Namely, we evaluate against a challenging low-resource, long-tailed, noisy multi-label data settings, where information is always limited, since the long tail grows with data size and because its modeling requires the majority of parameters (Hooker et al., 2020b) . For robust evaluation, we use a typical training, development, test setup and first establish a solid, supervised baseline for many-class multi-label classification that is optimised with a set of generalisation techniques proposed by Jiang et al. (2020) . For evaluation in supervised, few and zero-shot learning scenarios, we analyse and propose evaluation metric choices which are meaningful across all scenarios for broader performance comparisons. Contributions: 1 We provide a straight-forward method for self-supervised contrastive labelembedding prediction and 2 evaluate it in a challenging, noisy long-tail, low-resource multi-label text prediction scenario. 3 We show that small-scale 'data-internal' pretraining (on 8-80MB of text) not only improves supervised performance, but also strongly boosts few and zero-shot learning by increasing self-supervision amounts for small data, rather than increasing data amounts via the standard large-scale external data pretraining approach.

2. RELATED WORK

Large to Web-scale data pretraining is at the core of state-of-the-art methods in computer vision (Chen et al., 2020) and language processing (Rogers et al., 2020; Brown et al., 2020) . However, challenges and disadvantages are increasingly being discussed. (i) A requirement of large-scale external text data resources (Yogatama et al., 2019; Schick & Schütze, 2020a) , (ii) an inability to pretrain recent architectures on small-scale data (Liu et al., 2020; Melis et al., 2020; S ¸erbetci et al., 2020) , (iii) calls for more challenging evaluation tasks (Linzen, 2020; McCoy et al., 2019) and (iv) diminishing returns of pretraining on large supervised datasets (Wang et al., 2020b) . To address issue (iii), challenging evaluations on long-tail prediction (Chang et al., 2019) , few-shot (Schick & Schütze, 2020a), or zero-shot (Brown et al., 2020) , were recently shown to benefit from self-supervised pretraining, but to-date, require massive, 'task-external', pretraining datasets. (c) Remarkably, Bansal et al. (2020) showed that for large 'data-external' pretraining, using more self-supervision, not more data, also boosts few-shot performance. This finding inspired us to collect evidence towards a core question: "Do we need massive data (signals) or just more (diverse) self-supervised learning signals for pretraining?". We collect evidence by posing three research questions and propose solutions that require designing approaches for issues (i-iii) as follows. One, to address issue (i), "can increasing self-supervision signals during 'data-internal' pretraining on small data, i.e. without large-scale 'data-external' pretraining, boost few and zero-shot performance"? Two, to address issue (ii), "what pretraining objectives and models do we chose that work without large training data"? Three, to address issue (iii), "within what challenging learning scenario should we evaluate while incorporating the now standard "any NLP task as a 'text-to-text' problem" paradigm (Raffel et al., 2020) "?

