TOWARD ADVERSARIAL TRAINING ON CONTEXTUAL-IZED LANGUAGE REPRESENTATION

Abstract

Beyond the success story of adversarial training (AT) in the recent text domain on top of pre-trained language models (PLMs), our empirical study showcases the inconsistent gains from AT on some tasks, e.g. commonsense reasoning, named entity recognition. This paper investigates AT from the perspective of the contextualized language representation outputted by PLM encoders. We find the current AT attacks lean to generate sub-optimal adversarial examples that can fool the decoder part but have a minor effect on the encoder. However, we find it necessary to effectively deviate the latter one to allow AT to gain. Based on the observation, we propose simple yet effective Contextualized representation-Adversarial Training (CreAT), in which the attack is explicitly optimized to deviate the contextualized representation of the encoder. It allows a global optimization of adversarial examples that can fool the entire model. We also find CreAT gives rise to a better direction to optimize the adversarial examples, to let them less sensitive to hyperparameters. Compared to AT, CreAT produces consistent performance gains on a wider range of tasks and is proven to be more effective for language pretraining where only the encoder part is kept for downstream tasks. We achieve the new state-of-the-art performances on a series of challenging benchmarks, e.g. AdvGLUE (59.1 → 61.1), HellaSWAG (93.0 → 94.9), ANLI (68.1 → 69.3).

1. INTRODUCTION

Adversarial training (AT) (Goodfellow et al., 2015) is designed to improve network robustness, in which the network is trained to withstand small but malicious perturbations while making correct predictions. In the text domain, recent studies (Zhu et al., 2020; Jiang et al., 2020) show that AT can be well-deployed on pre-trained language models (PLMs) (e.g. BERT (Devlin et al., 2019 ), RoBERTa (Liu et al., 2019) ) and produce impressive performance gains on a number of natural language understanding (NLU) benchmarks (e.g. sentiment analysis, QA). However, there remains a concern as to whether it is the adversarial examples that facilitate the model training. Some studies (Moyer et al., 2018; Aghajanyan et al., 2021) point out that a similar performance gain can be achieved when imposing random perturbations. To answer the question, we present comprehensive empirical results of AT on wider types of NLP tasks (e.g. reading comprehension, dialogue, commonsense reasoning, NER). It turns out the performances under AT are inconsistent across tasks. On some tasks, AT can appear mediocre or even harmful. This paper investigates AT from the perspective of the contextualized language representation, obtained by the Transformer encoder (Vaswani et al., 2017) . The background is that a PLM is typically composed of two parts, a Transformer-based encoder and a decoder. The decoder can vary from tasks (sometimes a linear classifier). Our study showcases that, the AT attack excels at fooling the

funding

* Corresponding author; This paper was partially supported by Key Projects of National Natural Science Foundation of China (U1836222 and 61733011); https://github.com/gingasan/CreAT.

