SELF-SUPERVISED CONTRASTIVE ZERO TO FEW-SHOT LEARNING FROM SMALL, LONG-TAILED TEXT DATA

Abstract

For natural language processing (NLP) 'text-to-text' tasks, prevailing approaches heavily rely on pretraining large self-supervised models on massive external data sources. However, this methodology is being critiqued for: exceptional compute and pretraining data requirements; diminishing returns on both large and small datasets; and evaluation settings that overestimate performance differences. The core belief behind current methodology, coined 'the bitter lesson' by R. Sutton, is that 'compute scale-up beats data and compute-efficient algorithms', neglecting that progress in compute hardware scale-up is based near entirely on the miniaturisation of resource consumption. We thus approach pretraining from a miniaturisation perspective, such as not to require massive external data sources and models, and avoid translations from continuous input embeddings to discrete labels. To minimise favourable evaluation, we examine learning on a challenging long-tailed, low-resource, multi-label text classification dataset with noisy, highly sparse labels and many rare concepts. To this end, we propose using a 'datasetinternal', self-supervised contrastive autoencoding approach for pretraining that enables marked improvements in zero-shot, few-shot and supervised learning performance; even under a challenging, otherwise avoided, low-resource scenario, without defaulting to large-scale external datasets as support training signals. Crucially, we find evidence that zero and few-shot learning markedly benefit from adding more 'dataset-internal', self-supervised training signals, e.g. when increasing self-supervised learning signals via large external sources is infeasible.

1. INTRODUCTION

The current prevailing approach to supervised and few-shot learning is to use self-supervised pretraining on large-scale 'task-external' data and then fine-tune on end-task labels. Recent studies have found that, thus far, this way of pretraining fails in low-resource settings (Yogatama et al., 2019; S ¸erbetci et al., 2020) and that reported performance improvements are caused in part by evaluation setups that are designed in line with the paradigm that "massive resources are pivotal" to improving language understanding (Linzen, 2020; Schick & Schütze, 2020a; Dodge et al., 2020; Brown et al., 2020 ) or computer vision (Chen et al., 2020) . Despite these critiques, the underlying goal of better initialisation of layer weights is a core requirement of successful learning with neural networks, where self-supervised layer-wise pretraining (Bengio et al., 2006) was replaced by better layer initialisation (Glorot & Bengio, 2010), which was in turn replaced by pretraining on growing amounts of external data (Bojanowski et al., 2017; Devlin et al., 2019; Chen et al., 2020; Brown et al., 2020) -i.e. FastText, BERT, SIMCLR and GPT-3. The latter three approaches require massive compute and data resources, but enable marked learning improvements in few-shot (SIMCLR, GPT-3) or zero-shot (GPT-3) scenarios compared to models that have several orders of magnitude fewer parameters. There are efforts to reduce model size requirements for few and zero-shot adaptation by orders of magnitude (Schick & Schütze, 2020a; b; Plank & Rethmeier, 2019) , with some being increasingly beneficial in scenarios with low input data (X), label resources (Y ), and rare events in X, Y . Crucially, such approaches do not simply rely on more data, but on creating better initialised input features X. In contrast, approaches like SIMCLR or BERT (Chen et al., 2020; Devlin et al., 2019) use self-supervision via contrastive learning and input masking on large-scale datasets to create broader learning signals than supervision provides. SIMCLR is based on a metric learning approach called contrastive self-supervision -i.e. learning to distinguish (dis-)similar inputs using

