KN O WDA: ALL-IN-ONE KNOWLEDGE MIXTURE MODEL FOR DATA AUGMENTATION IN LOW-RESOURCE NLP TASKS

Abstract

This paper focuses on the data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pretrained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA) which is an Seq2Seq language model pretrained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one) such that KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format, and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL'03 and WikiAnn; ii) KnowDA successfully transfer the task knowledge to NLP tasks whose types are seen and unseen in KoMT.

1. INTRODUCTION

Neural NLP models require extensive supervised training data to achieve superior performance (Bowman et al., 2015) . However, due to the enormous cost of annotating data, developers could only use limited labeled data for training in common real-world uses of neural NLP models. This problem has attracted considerable attention recently. Many researchers (Kumar et al., 2020; Wang et al., 2022; Zhou et al., 2021) resort to data augmentation techniques to generate more synthetic samples to boost the performance of low-resource NLP tasks. The existing NLP data augmentation (DA) methods either leverage task-independent heuristic rules, such as Synonym Replacement (Zhang et al., 2015) and Random Swap (Wei & Zou, 2019a) , or fine-tune general-purpose pre-trained language models by using the handful training examples of target tasks, such as GPT2 (Radford et al., 2019) in LAMBADA (Anaby-Tavor et al., 2020) and T5 (Raffel et al., 2020) in PromDA (Wang et al., 2022) , to produce new synthetic data. Consequently, these DA methods have trivial target task knowledge and are limited to yielding low-quality synthetic data (e.g., either irrelevant or extremely similar to the training data). In addition, these DA methods are often applied to simple NLP tasks where the inputs only include single and short sentences. Recently, Zhou et al. (2021) demonstrate that these DA methods perform even worse on tasks with complicated structures (e.g., SuperGLUE). These issues prevent the existing DA methods from practical usage for various low-resource NLP tasks. Motivated by this, in this paper, we propose Knowledge Mixture Data Augmentation Model (KnowDA) to tackle these two issues. To enrich the task knowledge for KnowDA, we propose Knowledge Mixture Training (KoMT) to represent various heterogeneous NLP tasks in a unified and scalable manner. Specifically, in KoMT, arbitrary NLP task instances are represented as the key-value list format where the key is typically a short phrase indicating the feature function and the value is a string representation of the feature content. We further employ the denoising training objectives in different granularity of the NLP task instances. That is, during KoMT, we randomly mask a subset of values (e.g., an input document or a question) and train KnowDA to reconstruct those masked ones. With the dynamic multi-granularity masking mechanism and the unified format, we successfully scale Knowledge Mixture Training (KoMT) of LM to about 100 NLP tasks without much human effort. Compared with previous attempts in unified multi-task learning works (e.g., T0 (Wei et al., 2022) and FLAN (Wei et al., 2022) ), KoMT is more scalable and comprehensive because those works heavily rely on the human-crafted prompts and are only trained to improve the NLP task performance (i.e., only generating correct output labels). Furthermore, previous data augmentation methods only focus on simple NLP tasks, such as single-sentence classification. However, modern NLP tasks, such as NLI and QA, have much more complicated structures (i.e., multiple input text and long documents). To enable KnowDA to handle these NLP tasks with complicated structures, we propose a novel auto-regressive generation framework for KnowDA. At each step, we either fine-tune or directly use (i.e., zero-shot) a separate copy of KnowDA to generate a textual feature in an NLP instance. The feature generation order is based on task-specific feature dependency. As KnowDA is trained generate arbitrary input text features, we find it beneficial to generate long text in a zero-shot manner directly using the KnowDA checkpoint. Under this scenario, we control the outputs of KnowDA using relevant feature keys and full example demonstrations from the target tasks. For evaluation, we conduct experiments on the challenging FewGLUE benchmark (Schick & Schütze, 2020) with 32 training examples. We also verify the effectiveness of KnowDA on two Sequence Labeling tasks, CoNLL'03 (Sang & De Meulder, 2003) and WikiAnn (Pan et al., 2017) , whose task types are held-out during KoMT. KnowDA successfully outperforms recently proposed state-of-the-art data augmentation algorithms such as FlipDA and PromDA (Wang et al., 2022) . We further compare the quality of generated synthetic data from KnowDA and FlipDA, confirming that KnowDA produces synthetic data with a higher level of diversity and better quality verified by humans. The contributions of this paper are the following:(1) To the best of our knowledge, we are the first work to scale the number of tasks to 100+ in multitask pretraining for data augmentation; (2) We propose a novel multitask pre-training approach KoMT for data augmentation, resulting in a new pre-trained model, KnowDA; and (3) KnowDA outperforms state-of-the-art data augmentation methods on the low-resource setting of popular benchmarks FewGLUE, CoNLL'03, and WikiAnn.

2. METHOD

In this section, our setting is introduced in Sec. 2.1. The details of KnowDA, including the design of KoMT and the auto-regressive data augmentation procedure, are discussed in Sections 2.3 and 2.4.

2.1. DATA AUGMENTATION FOR LOW-RESOURCE NLP

In the low-resource NLP tasks, only a handful of labeled training data T = {(x i , y i )} n i=1 are available. Data Augmentation generates synthetic data T Syn = {(x i , ŷi )} m i=1 from the original labeled training data T using language models, where m is allowed to be much larger than n. The goal is that NLP models trained using T ∪ T Syn outperform the ones only trained using T .

2.2. OVERVIEW OF KN O WDA

KnowDA is an encoder-decoder generative language model that generates task-relevant and diverse synthetic data from scratch. It is initialized from an existing pre-trained encoder-decoder language

