INFOBERT: IMPROVING ROBUSTNESS OF LANGUAGE MODELS FROM AN INFORMATION THEORETIC PERSPECTIVE

Abstract

Large-scale pre-trained language models such as BERT and RoBERTa have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer, which suppresses noisy mutual information between the input and the feature representation; and (ii) an Anchored Feature regularizer, which increases the mutual information between local stable features and global features. We provide a principled way to theoretically analyze and improve the robustness of language models in both standard and adversarial training. Extensive experiments demonstrate that InfoBERT achieves stateof-the-art robust accuracy over several adversarial datasets on Natural Language Inference (NLI) and Question Answering (QA) tasks. Our code is available at https://github.com/AI-secure/InfoBERT.

1. INTRODUCTION

Self-supervised representation learning pre-trains good feature extractors from massive unlabeled data, which show promising transferability to various downstream tasks. Recent success includes large-scale pre-trained language models (e.g., BERT, RoBERTa, and GPT-3 (Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020) ), which have advanced state of the art over a wide range of NLP tasks such as NLI and QA, even surpassing human performance. Specifically, in the computer vision domain, many studies have shown that self-supervised representation learning is essentially solving the problem of maximizing the mutual information (MI) I(X; T ) between the input X and the representation T (van den Oord et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019; Chen et al., 2020) . Since MI is computationally intractable in high-dimensional feature space, many MI estimators (Belghazi et al., 2018) have been proposed to serve as lower bounds (Barber & Agakov, 2003; van den Oord et al., 2018) or upper bounds (Cheng et al., 2020) of MI. Recently, Kong et al. point out that the MI maximization principle of representation learning can be applied to not only computer vision but also NLP domain, and propose a unified view that recent pre-trained language models are maximizing a lower bound of MI among different segments of a word sequence. On the other hand, deep neural networks are known to be prone to adversarial examples (Goodfellow et al., 2015; Papernot et al., 2016; Eykholt et al., 2017; Moosavi-Dezfooli et al., 2016) , i.e., the outputs of neural networks can be arbitrarily wrong when human-imperceptible adversarial perturbations are added to the inputs. Textual adversarial attacks typically perform word-level substitution (Ebrahimi et al., 2018; Alzantot et al., 2018; Ren et al., 2019) or sentence-level paraphrasing (Iyyer et al., 2018; Zhang et al., 2019) to achieve semantic/utility preservation that seems innocuous to human, while fools NLP models. Recent studies (Jin et al., 2020; Zang et al., 2020; Nie et al., 2020; Wang et al., 2020) further show that even large-scale pre-trained language models (LM) such as BERT are vulnerable to adversarial attacks, which raises the challenge of building robust real-world LM applications against unknown adversarial attacks. We investigate the robustness of language models from an information theoretic perspective, and propose a novel learning framework InfoBERT, which focuses on improving the robustness of language representations by fine-tuning both local features (word-level representation) and global features (sentence-level representation) for robustness purpose. InfoBERT considers two MI-based regularizers: (i) the Information Bottleneck regularizer manages to extract approximate minimal sufficient statistics for downstream tasks, while removing excessive and noisy information that may incur adversarial attacks; (ii) the Anchored Feature regularizer carefully selects useful local stable features that are invulnerable to adversarial attacks, and maximizes the mutual information between local stable features and global features to improve the robustness of the global representation. In this paper, we provide a detailed theoretical analysis to explicate the effect of InfoBERT for robustness improvement, along with extensive empirical adversarial evaluation to validate the theory. Our contributions are summarized as follows. (i) We propose a novel learning framework InfoBERT from the information theory perspective, aiming to effectively improve the robustness of language models. (ii) We provide a principled theoretical analysis on model robustness, and propose two MIbased regularizers to refine the local and global features, which can be applied to both standard and adversarial training for different NLP tasks. (iii) Comprehensive experimental results demonstrate that InfoBERT can substantially improve robust accuracy by a large margin without sacrificing the benign accuracy, yielding the state-of-the-art performance across multiple adversarial datasets on NLI and QA tasks.

2. RELATED WORK

Textual Adversarial Attacks/Defenses Most existing textual adversarial attacks focus on wordlevel adversarial manipulation. Ebrahimi et al. (2018) is the first to propose a whitebox gradientbased attack to search for adversarial word/character substitution. Following work (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Jin et al., 2020) further constrains the perturbation search space and adopts Part-of-Speech checking to make NLP adversarial examples look natural to human. To defend against textual adversarial attacks, existing work can be classified into three categories: (i) Adversarial Training is a practical method to defend against adversarial examples. Existing work either uses PGD-based attacks to generate adversarial examples in the embedding space of NLP as data augmentation (Zhu et al., 2020a) , or regularizes the standard objective using virtual adversarial training (Jiang et al., 2020; Liu et al., 2020; Gan et al., 2020) . However, one drawback is that the threat model is often unknown, which renders adversarial training less effective when facing unseen attacks. (ii) Interval Bound Propagation (IBP) (Dvijotham et al., 2018) is proposed as a new technique to consider the worst-case perturbation theoretically. Recent work (Huang et al., 2019; Jia et al., 2019) has applied IBP in the NLP domain to certify the robustness of models. However, IBPbased methods rely on strong assumptions of model architecture and are difficult to adapt to recent transformer-based language models. (iii) Randomized Smoothing (Cohen et al., 2019) provides a tight robustness guarantee in 2 norm by smoothing the classifier with Gaussian noise. Ye et al. (2020) adapts the idea to the NLP domain, and replace the Gaussian noise with synonym words to certify the robustness as long as adversarial word substitution falls into predefined synonym sets. However, to guarantee the completeness of the synonym set is challenging. Representation Learning MI maximization principle has been adopted by many studies on selfsupervised representation learning (van den Oord et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019; Chen et al., 2020 ). Specifically, InfoNCE (van den Oord et al., 2018) is used as the lower bound of MI, forming the problem as contrastive learning (Saunshi et al., 2019; Yu et al., 2020) . However, Tian et al. (2020) suggests that the InfoMax (Linsker, 1988) principle may introduce excessive and noisy information, which could be adversarial. To generate robust representation, Zhu et al. (2020b) formalizes the problem from a mutual-information perspective, which essentially performs adversarial training for worst-case perturbation, while mainly considers the continuous space in computer vision. In contrast, InfoBERT originates from an information-theoretic perspective and is compatible with both standard and adversarial training for discrete input space of language models.

