KN O WDA: ALL-IN-ONE KNOWLEDGE MIXTURE MODEL FOR DATA AUGMENTATION IN LOW-RESOURCE NLP TASKS

Abstract

This paper focuses on the data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pretrained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA) which is an Seq2Seq language model pretrained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one) such that KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format, and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL'03 and WikiAnn; ii) KnowDA successfully transfer the task knowledge to NLP tasks whose types are seen and unseen in KoMT.

1. INTRODUCTION

Neural NLP models require extensive supervised training data to achieve superior performance (Bowman et al., 2015) . However, due to the enormous cost of annotating data, developers could only use limited labeled data for training in common real-world uses of neural NLP models. This problem has attracted considerable attention recently. Many researchers (Kumar et al., 2020; Wang et al., 2022; Zhou et al., 2021) resort to data augmentation techniques to generate more synthetic samples to boost the performance of low-resource NLP tasks. The existing NLP data augmentation (DA) methods either leverage task-independent heuristic rules, such as Synonym Replacement (Zhang et al., 2015) and Random Swap (Wei & Zou, 2019a), or fine-tune general-purpose pre-trained language models by using the handful training examples of target tasks, such as GPT2 (Radford et al., 2019) in LAMBADA (Anaby-Tavor et al., 2020) and T5 (Raffel et al., 2020) in PromDA (Wang et al., 2022) , to produce new synthetic data. Consequently, these DA methods have trivial target task knowledge and are limited to yielding low-quality synthetic data (e.g., either irrelevant or extremely similar to the training data). In addition, these DA methods

