PMIXUP: SIMULTANEOUS UTILIZATION OF PART-OF-SPEECH REPLACEMENT AND FEATURE SPACE INTERPOLATION FOR TEXT DATA AUGMENTATION

Abstract

Data augmentation has become a de facto technique in various NLP tasks to overcome the lack of a large-scale, qualified training set. The previous studies presented several data augmentation methods, such as replacing tokens with synonyms or interpolating feature space of given text input. While they are known to be convenient and promising, several limits exist. First, prior studies simply treated topic classification and sentiment analysis under the same category of text classification while we presume they have distinct characteristics. Second, previously-proposed replacement-based methods bear several improvement avenues as they utilize heuristics or statistical approaches for choosing synonyms. Lastly, while the feature space interpolation method achieved current state-ofthe-art, prior studies have not comprehensively utilized it with replacement-based methods. To mitigate these drawbacks, we first analyzed which POS tags are important in each text classification task, and resulted that nouns are essential to topic classification, while sentiment analysis regards verbs and adjectives as important POS information. Contrary to the aforementioned analysis, we discover that augmenting verbs and adjective tokens commonly improves text classification performance regardless of its type. Lastly, we propose PMixUp, a novel data augmentation strategy that simultaneously utilizes replacement-based and feature space interpolation methods. We examine that they are new state-of-the-art in nine public benchmark settings, especially under the few training samples.

1. INTRODUCTION

Background and Motivation Recent improvements in deep neural networks have empowered remarkable advancements in various Natural Langauge Processing (NLP) tasks such as text classification (Minaee et al., 2021) , question answering (Rogers et al., 2021) , and natural language inference (Bowman & Zhu, 2019) . However, these supreme performances rely on large, qualified training sets under the supervised regime. Under the circumstance where the machine learning practitioners have a limited number of training samples, the models tend to suffer from overfitting and fail to function with the expected performance. As acquiring large-scale, qualified training samples requires a particular amount of resource consumption, there have been numerous studies to escalate the model's performance under the limited number of training samples in NLP tasks (Hedderich et al., 2020) . One promising approach is data augmentation, which generates new training samples by modifying original training samples through transformations (Chen et al., 2021) . An underlying motivation of data augmentation is to augment original training samples while the transformed text data sustain the original sample's overall semantics (Bayer et al., 2021) . Several studies on data augmentation techniques proposed back-translation (Sennrich et al., 2015) and word replacements with predictive language models (Anaby-Tavor et al., 2020; Yang et al., 2020b) . As these approaches require the massive cost of implementation (i.e., a well-trained translation model for back-translation), academia sought more lightweight augmentation methods. Several studies presented replacement-based augmentation methods where tokens are replaced with synonyms fetched from large-scale dictionaries (i.e., WordNet) as the synonyms are less likely to affect the semantics of original training samples (Bayer et al., 2021; Feng et al., 2020; Wei & Zou, 2019) . While replacement-based augmentation methods escalate the model's performance, it risks creating unrealistic or non-conforming samples. This drawback motivated the development of feature space interpolation methods that transform a given input in a feature space rather than directly changing tokens (Sun et al., 2020; Chen et al., 2020b; Shen et al., 2020) . It has become state-of-the-art in data augmentations on text classification. Main Idea and Its Novelty When we scrutinize replacement-based approaches, we figure out that most of them did not precisely consider Part-of-Speech (POS) information. We presume there exists a particular POS that highly correlates to the text classification performance, but it was not actively considered in past studies. Furthermore, we hypothesize that essential POS information would differ along with text classification types (topic classification and sentiment analysis) presented in (Sun et al., 2019) . As these tasks have varying characteristics among themselves, we expect the augmentation strategies should be considered different respective to each task. But, it was not considered in the prior works. Lastly, replacement-based methods and feature space interpolation approaches have not been simultaneously considered. The replacement-based methods have an advantage in providing more contextual augmentation results as they directly transform the word token with its synonyms. On the other hand, feature space interpolation methods are also effective because they create an infinite amount of new augmented data samples as they do not transform the token, but instead, add perturbations to the feature vector. Then, the following question becomes our primary motivational question: What if we simultaneously utilize both replacement-based and feature space interpolation methods? To this end, we propose a novel data augmentation method denoted as Part-of-speech MixUp (PMixUp), which replaces tokens belonging to the particular POS tags and applies feature space interpolation in sequential order. We presume simultaneous utilization of both replacement-based and feature space interpolation methods would further escalate the classification performance. As a preliminary analysis, we firstly examine which POS information influences two text classification types the most and validate whether augmenting tokens that do not belong to important POS tags helps maintain the original sentence's overall semantics. We then examined its effectiveness in nine benchmark datasets under the various training sample sizes per class and resulted in the proposed PMixUp accomplishing new state-of-the-art in given public benchmark settings. We hereby highlight our work's novelty in the following aspects. First, our work firstly analyzes how POS information becomes important following each classification type. Second, to the best of our knowledge, our study is the first attempt to simultaneously utilize replacement-based and feature space interpolation methods for text augmentation.

Key Contributions

• We empirically discovered that nouns serve as critical label determinants in topic classification while verbs and adjectives play essential roles in sentiment analysis. • We figured out that replacing tokens that do not belong to the aforementioned important POS tags at both text classification tasks contributes to maintaining the original sentence's meaning and elevating the test classification performance. Instead, we discovered a common trend across the two tasks that replacing tokens under verbs and adjectives POS tag escalated the test performance at most, and we analyze the reason for this phenomenon as the low interference with core semantics of original text data when replaced. • We present PMixUp, a novel data augmentation technique that combines POS-guided replacement and feature space interpolation methods. Upon nine public benchmark datasets, we examined the proposed PMixUp's supreme performance of classification performance escalation rather than prior works; thus, it becomes a new state-of-the-art in given settings. • We discovered that PMixUp is especially effective under a few training samples per class. We further observed that every data augmentation method's beneficial impact decreases when there are many training samples. • Lastly, we scrutinized the supremacy of PMixUp derives from larger knowledge capacity, that our method makes the model acquire more fruitful understanding at given data.

2. RELATED WORKS

Replacement-based Data Augmentation The replacement-based augmentation strategies generate text samples by replacing particular tokens or words with synonyms (Wei & Zou, 2019) . This

