INFOBERT: IMPROVING ROBUSTNESS OF LANGUAGE MODELS FROM AN INFORMATION THEORETIC PERSPECTIVE

Abstract

Large-scale pre-trained language models such as BERT and RoBERTa have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer, which suppresses noisy mutual information between the input and the feature representation; and (ii) an Anchored Feature regularizer, which increases the mutual information between local stable features and global features. We provide a principled way to theoretically analyze and improve the robustness of language models in both standard and adversarial training. Extensive experiments demonstrate that InfoBERT achieves stateof-the-art robust accuracy over several adversarial datasets on Natural Language Inference (NLI) and Question Answering (QA) tasks. Our code is available at https://github.com/AI-secure/InfoBERT.

1. INTRODUCTION

Self-supervised representation learning pre-trains good feature extractors from massive unlabeled data, which show promising transferability to various downstream tasks. Recent success includes large-scale pre-trained language models (e.g., BERT, RoBERTa, and GPT-3 (Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020) ), which have advanced state of the art over a wide range of NLP tasks such as NLI and QA, even surpassing human performance. Specifically, in the computer vision domain, many studies have shown that self-supervised representation learning is essentially solving the problem of maximizing the mutual information (MI) I(X; T ) between the input X and the representation T (van den Oord et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019; Chen et al., 2020) . Since MI is computationally intractable in high-dimensional feature space, many MI estimators (Belghazi et al., 2018) have been proposed to serve as lower bounds (Barber & Agakov, 2003; van den Oord et al., 2018) or upper bounds (Cheng et al., 2020) of MI. Recently, Kong et al. point out that the MI maximization principle of representation learning can be applied to not only computer vision but also NLP domain, and propose a unified view that recent pre-trained language models are maximizing a lower bound of MI among different segments of a word sequence. On the other hand, deep neural networks are known to be prone to adversarial examples (Goodfellow et al., 2015; Papernot et al., 2016; Eykholt et al., 2017; Moosavi-Dezfooli et al., 2016) , i.e., the outputs of neural networks can be arbitrarily wrong when human-imperceptible adversarial perturbations are added to the inputs. Textual adversarial attacks typically perform word-level substitution (Ebrahimi et al., 2018; Alzantot et al., 2018; Ren et al., 2019) or sentence-level paraphrasing (Iyyer et al., 2018; Zhang et al., 2019) to achieve semantic/utility preservation that seems innocuous to human, while fools NLP models. Recent studies (Jin et al., 2020; Zang et al., 2020; Nie et al., 2020; Wang et al., 2020) further show that even large-scale pre-trained language models (LM) such as BERT are vulnerable to adversarial attacks, which raises the challenge of building robust real-world LM applications against unknown adversarial attacks. We investigate the robustness of language models from an information theoretic perspective, and propose a novel learning framework InfoBERT, which focuses on improving the robustness of language representations by fine-tuning both local features (word-level representation) and global features (sentence-level representation) for robustness purpose. InfoBERT considers two MI-based regularizers: (i) the Information Bottleneck regularizer manages to extract approximate minimal sufficient statistics for downstream tasks, while removing excessive and noisy information that may incur adversarial attacks; (ii) the Anchored Feature regularizer carefully selects useful local stable features that are invulnerable to adversarial attacks, and maximizes the mutual information between local stable features and global features to improve the robustness of the global representation. In this paper, we provide a detailed theoretical analysis to explicate the effect of InfoBERT for robustness improvement, along with extensive empirical adversarial evaluation to validate the theory. Our contributions are summarized as follows. (i) We propose a novel learning framework InfoBERT from the information theory perspective, aiming to effectively improve the robustness of language models. (ii) We provide a principled theoretical analysis on model robustness, and propose two MIbased regularizers to refine the local and global features, which can be applied to both standard and adversarial training for different NLP tasks. (iii) Comprehensive experimental results demonstrate that InfoBERT can substantially improve robust accuracy by a large margin without sacrificing the benign accuracy, yielding the state-of-the-art performance across multiple adversarial datasets on NLI and QA tasks.

2. RELATED WORK

Textual Adversarial Attacks/Defenses Most existing textual adversarial attacks focus on wordlevel adversarial manipulation. Ebrahimi et al. (2018) is the first to propose a whitebox gradientbased attack to search for adversarial word/character substitution. Following work (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Jin et al., 2020) further constrains the perturbation search space and adopts Part-of-Speech checking to make NLP adversarial examples look natural to human. To defend against textual adversarial attacks, existing work can be classified into three categories: (i) Adversarial Training is a practical method to defend against adversarial examples. Existing work either uses PGD-based attacks to generate adversarial examples in the embedding space of NLP as data augmentation (Zhu et al., 2020a) , or regularizes the standard objective using virtual adversarial training (Jiang et al., 2020; Liu et al., 2020; Gan et al., 2020) . However, one drawback is that the threat model is often unknown, which renders adversarial training less effective when facing unseen attacks. (ii) Interval Bound Propagation (IBP) (Dvijotham et al., 2018) is proposed as a new technique to consider the worst-case perturbation theoretically. Recent work (Huang et al., 2019; Jia et al., 2019) has applied IBP in the NLP domain to certify the robustness of models. However, IBPbased methods rely on strong assumptions of model architecture and are difficult to adapt to recent transformer-based language models. (iii) Randomized Smoothing (Cohen et al., 2019) provides a tight robustness guarantee in 2 norm by smoothing the classifier with Gaussian noise. Ye et al. (2020) adapts the idea to the NLP domain, and replace the Gaussian noise with synonym words to certify the robustness as long as adversarial word substitution falls into predefined synonym sets. However, to guarantee the completeness of the synonym set is challenging. Representation Learning MI maximization principle has been adopted by many studies on selfsupervised representation learning (van den Oord et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019; Chen et al., 2020 ). Specifically, InfoNCE (van den Oord et al., 2018) is used as the lower bound of MI, forming the problem as contrastive learning (Saunshi et al., 2019; Yu et al., 2020) . However, Tian et al. (2020) suggests that the InfoMax (Linsker, 1988) principle may introduce excessive and noisy information, which could be adversarial. To generate robust representation, Zhu et al. (2020b) formalizes the problem from a mutual-information perspective, which essentially performs adversarial training for worst-case perturbation, while mainly considers the continuous space in computer vision. In contrast, InfoBERT originates from an information-theoretic perspective and is compatible with both standard and adversarial training for discrete input space of language models.

3. INFOBERT

Before diving into details, we first discuss the textual adversarial examples we consider in this paper. We mainly focus on the dominant word-level attack as the main threat model, since it achieves higher attack success and is less noticeable to human readers than other attacks. Due to the discrete nature of text input space, it is difficult to measure adversarial distortion on token level. Instead, because most word-level adversarial attacks (Li et al., 2019; Jin et al., 2020) constrain word perturbations via the bounded magnitude in the semantic embedding space, by adapting from Jacobsen et al. (2019) , we define the adversarial text examples with distortions constrained in the embedding space. Definition 3.1. ( -bounded Textual Adversarial Examples) . Given a sentence x = [x 1 ; x 2 ; ...; x n ], where x i is the word at the i-th position, the -bounded adversarial sentence x = [x 1 ; x 2 ; ...; x n ] for a classifier F satisfies: (1) F(x) = o(x) = o(x ) but F(x ) = o(x ) , where o(•) is the oracle (e.g., human decision-maker); (2) ||t i -t i || 2 ≤ for i = 1, 2, ..., n, where ≥ 0 and t i is the word embedding of x i .

3.1. INFORMATION BOTTLENECK AS A REGULARIZER

In this section, we first discuss the general IB implementation, and then explain how IB formulation is adapted to InfoBERT as a regularizer along with theoretical analysis to support why IB regularizer can help improve the robustness of language models. The IB principle formulates the goal of deep learning as an information-theoretic trade-off between representation compression and predictive power (Tishby & Zaslavsky, 2015) . Given the input source X, a deep neural net learns the internal representation T of some intermediate layer and maximizes the MI between T and label Y , so that T subject to a constraint on its complexity contains sufficient information to infer the target label Y . Finding an optimal representation T can be formulated as the maximization of the Lagrangian L IB = I(Y ; T ) -βI(X; T ), where β > 0 is a hyper-parameter to control the tradeoff, and I(Y ; T ) is defined as: I(Y ; T ) = p(y, t) log p(y, t) p(y)p(t) dy dt . (2) Since Eq. ( 2) is intractable, we instead use the lower bound from Barber & Agakov (2003) : I(Y ; T ) ≥ p(y, t) log q ψ (y | t) dy dt , where q ψ (y|t) is the variational approximation learned by a neural network parameterized by ψ for the true distribution p(y|t). This indicates that maximizing the lower bound of the first term of IB I(Y ; T ) is equivalent to minimizing the task cross-entropy loss task = H(Y | T ). To derive a tractable lower bound of IB, we here use an upper bound (Cheng et al., 2020) of I(X; T ) I(X; T ) ≤ p(x, t) log(p(t | x)) dx dt -p(x)p(t) log(p(t | x)) dx dt . By combining Eq. ( 3) and (4), we can maximize the tractable lower bound LIB of IB in practice by: LIB = 1 N N i=1 log q ψ (y (i) | t (i) ) - β N N i=1 log(p(t (i) | x (i) )) - 1 N N j=1 log(p(t (j) | x (i) )) (5) with data samples {x (i) , y (i) } N i=1 , where q ψ can represent any classification model (e.g., BERT), and p(t | x) can be viewed as the feature extractor f θ : X → T , where X and T are the support of the input source X and extracted feature T , respectively. The above is a general implementation of IB objective function. In InfoBERT, we consider T as the features consisting of the local word-level features after the BERT embedding layer f θ . The following BERT self-attentive layers along with the linear classification head serve as q ψ (y|t) that predicts the target Y given representation T . Formally, given random variables X = [X 1 ; X 2 ; ...; X n ] representing input sentences with X i (word token at i-th index), let T = [T 1 ; ...; T n ] = f θ ([X 1 ; X 2 ; ...; X n ]) = [f θ (X 1 ); f θ (X 2 ); ...; f θ (X n )] denote the random variables representing the features generated from input X via the BERT embedding layer f θ , where T i ∈ R d is the high-dimensional word-level local feature for word X i . Due to the high dimensionality d of each word feature (e.g., 1024 for BERT-large), when the sentence length n increases, the dimensionality of features T becomes too large to compute I(X; T ) in practice. Thus, we propose to maximize a localized formulation of IB L LIB defined as: L LIB := I(Y ; T ) -nβ n i=1 I(X i ; T i ). Theorem 3.1. (Lower Bound of L IB ) Given a sequence of random variables X = [X 1 ; X 2 ; ...; X n ] and a deterministic feature extractor f θ , let T = [T 1 ; ...; T n ] = [f θ (X 1 ); f θ (X 2 ); ...; f θ (X n )]. Then the localized formulation of IB L LIB is a lower bound of L IB (Eq. ( 1)), i.e., I(Y ; T ) -βI(X; T ) ≥ I(Y ; T ) -nβ n i=1 I(X i ; T i ). Theorem 3.1 indicates that we can maximize the localized formulation of L LIB as a lower bound of IB L IB when I(X; T ) is difficult to compute. In Eq. ( 6), if we regard the first term (I(Y ; T )) as a task-related objective, the second term (-nβ n i=1 I(X i ; T i )) can be considered as a regularization term to constrain the complexity of representation T , thus named as Information Bottleneck regularizer. Next, we give a theoretical analysis for the adversarial robustness of IB and demonstrate why localized IB objective function can help improve the robustness to adversarial attacks. Following Definition 3.1, let T = [T 1 ; T 2 ; ...; T n ] and T = [T 1 ; T 2 ; ...; T n ] denote the features for the benign sentence X and adversarial sentence X . The distributions of X and X are denoted by probability p(x) and q(x) with the support X and X , respectively. We assume that the feature representation T has finite support denoted by T considering the finite vocabulary size in NLP. Theorem 3.2. (Adversarial Robustness Bound) For random variables X = [X 1 ; X 2 ; ...; X n ] and  X = [X 1 ; X 2 ; ...; X n ], let T = [T 1 ; T 2 ; ...; T n ] = [f θ (X 1 ); f θ (X 2 ); ...; f θ (X n )] and T = [T 1 ; T 2 ; ...; T n ] = [f θ (X 1 ); f θ (X 2 ); ...; f θ (X n )] with finite support T , |I(Y ; T ) -I(Y ; T )| ≤ B 0 + B 1 n i=1 |T |(I(X i ; T i )) 1/2 + B 2 n i=1 |T | 3/4 (I(X i ; T i )) 1/4 + B 3 n i=1 |T |(I(X i ; T i )) 1/2 + B 4 n i=1 |T | 3/4 (I(X i ; T i )) 1/4 , where B 0 , B 1 , B 2 , B 3 and B 4 are constants depending on the sequence length n, and p(x). The sketch of the proof is to express the difference of |I(Y ; T ) -I(Y ; T )| in terms of I(X i ; T i ). Specifically, Eq. ( 25) factorizes the difference into two summands. The first summand, the conditional entropy |H(T | Y ) -H(T | Y )| , can be bound by Eq. ( 42) in terms of MI between benign/adversarial input and representation I(X i ; T i ) and I(X i ; T i ). The second summand |H(T ) -H(T )| has a constant upper bound (Eq. ( 85)), since language models have bounded vocabulary size and embedding space, and thus have bounded entropy. The intuition of Theorem 3.2 is to bound the adversarial performance drop |I(Y ; T ) -I(Y ; T )| by I(X i ; T i ). As explained in Eq. ( 3), I(Y ; T ) and I(Y ; T ) can be regarded as the model performance on benign and adversarial data. Thus, the LHS of the bound represents such a performance gap. The adversarial robustness bound of Theorem 3.2 indicates that the performance gap becomes closer when I(X i ; T i ) and I(X i ; T i ) decrease. Note that our IB regularizer in the objective function Eq. ( 6) achieves the same goal of minimizing I(X i ; T i ) while learning the most efficient information features, or approximate minimal sufficient statistics, for downstream tasks. Theorem 3.2 also suggests that combining adversarial training with our IB regularizer can further minimize I(X i ; T i ), leading to better robustness, which is verified in §4.

3.2. ANCHORED FEATURE REGULARIZER

In addition to the IB regularizer that suppresses noisy information that may incur adversarial attacks, we propose a novel regularizer termed "Anchored Feature Regularizer", which extracts local Algorithm 1 -Local Anchored Feature Extraction. This algorithm takes in the word local features and returns the index of local anchored features. 1: Input: Word local features t, upper and lower threshold c h and c l 2: δ ← 0 // Initialize the perturbation vector δ 3: g(δ) = ∇ δ task (q ψ (t + δ), y) // Perform adversarial attack on the embedding space 4: Sort the magnitude of the gradient of the perturbation vector from ||g(δ) 1 || 2 , ||g(δ) 2 || 2 , ..., ||g(δ) n || 2 into ||g(δ) k1 || 2 , ||g(δ) k2 || 2 , . .., ||g(δ) kn || 2 in ascending order, where z i corresponds to its original index. 5: Return: k i , k i+1 , ..., k j , where c l ≤ i n ≤ j n ≤ c h . stable features and aligns them with sentence global representations, thus improving the stability and robustness of language representations. The goal of the local anchored feature extraction is to find features that carry useful and stable information for downstream tasks. Instead of directly searching for local anchored features, we start with searching for nonrobust and unuseful features. To identify local nonrobust features, we perform adversarial attacks to detect which words are prone to changes under adversarial word substitution. We consider these vulnerable words as features nonrobust to adversarial threats. Therefore, global robust sentence representations should rely less on these vulnerable statistical clues. On the other hand, by examining the adversarial perturbation on each local word feature, we can also identify words that are less useful for downstream tasks. For example, stopwords and punctuation usually carry limited information, and tend to have smaller adversarial perturbations than words containing more effective information. Although these unuseful features are barely changed under adversarial attacks, they contain insufficient information and should be discarded. After identifying the nonrobust and unuseful features, we treat the remaining local features in the sentences as useful stable features and align the global feature representation based on them. During the local anchored feature extraction, we perform "virtual" adversarial attacks that generate adversarial perturbation in the embedding space, as it abstracts the general idea for existing wordlevel adversarial attacks. Formally, given an input sentence x = [x 1 ; x 2 ; ...; x n ] with its corresponding local embedding representation t = [t 1 ; ...; t n ], where x and t are the realization of random variables X and T , we generate adversarial perturbation δ in the embedding space so that the task loss task increases. The adversarial perturbation δ is initialized to zero, and the gradient of the loss with respect to δ is calculated by g (δ) = ∇ δ task (q ψ (t+δ), y) to update δ ← ||δ|| F ≤ (ηg(δ)/||g(δ)|| F ). The above process is similar to one-step PGD with zero-initialized perturbation δ. Since we only care about the ranking of perturbation to decide on robust features, in practice we skip the update of δ to save computational cost, and simply examine the 2 norm of the gradient g(δ) i of the perturbation on each word feature t i . A feasible plan is to choose the words whose perturbation is neither too large (nonrobust features) nor too small (unuseful features), e.g., the words whose perturbation rankings are among 50% ∼ 80% of all the words. The detailed procedures are provided in Algorithm 1. After local anchored features are extracted, we propose to align sentence global representations Z with our local anchored features T i . In practice, we can use the final-layer [CLS] embedding to represent global sentence-level feature Z. Specifically, we use the information theoretic tool to increase the mutual information I(T i ; Z) between local anchored features T i and sentence global representations Z, so that the global representations can share more robust and useful information with the local anchored features and focus less on the nonrobust and unuseful ones. By incorporating the term I(T i ; Z) into the previous objective function Eq. ( 6), our final objective function becomes: max I(Y ; T ) -nβ n i=1 I(X i ; T i ) + α M j=1 I(T kj ; Z), where T kj are the local anchored features selected by Algorithm 1 and M is the number of local anchored features. An illustrative figure can be found in Appendix Figure 2 . In addition, due to the intractability of computing MI, we use InfoNCE (van den Oord et al., 2018) as the lower bound of MI to approximate the last term I(T kj ; Z): Î(InfoNCE) (T i ; Z) := E P g ω (t i , z) -E P log t i e gω(t i ,z) , where g ω (•, •) is a score function (or critic function) approximated by a neural network, t i are the positive samples drawn from the joint distribution P of local anchored features and global representations, and t i are the negative samples drawn from the distribution of nonrobust and unuseful features P.

4. EXPERIMENTS

In this section, we demonstrate how effective InfoBERT improves the robustness of language models over multiple NLP tasks such as NLI and QA. We evaluate InfoBERT against both strong adversarial datasets and state-of-the-art adversarial attacks. Baselines Since IBP-based methods (Huang et al., 2019; Jia et al., 2019) cannot be applied to largescale language models yet, and the randomized-smoothing-based method (Ye et al., 2020) achieves limited certified robustness, we compare InfoBERT against three competitive baselines based on adversarial training: (I) FreeLB (Zhu et al., 2020a) applies adversarial training to language models during fine-tuning stage to improve generalization. In §4.2, we observe that FreeLB can boost the robustness of language models by a large margin. (II) SMART (Jiang et al., 2020) uses adversarial training as smoothness-inducing regularization and Bregman proximal point optimization during fine-tuning, to improve the generalization and robustness of language models. (III) ALUM (Liu et al., 2020) performs adversarial training in both pre-training and fine-tuning stages, which achieves substantial performance gain on a wide range of NLP tasks. Due to the high computational cost of adversarial training, we compare InfoBERT to ALUM and SMART with the best results reported in the original papers.

4.1. EXPERIMENTAL SETUP

Evaluation Metrics We use robust accuracy or robust F1 score to measure how robust the baseline models and InfoBERT are when facing adversarial data. Specifically, robust accuracy is calculated by: Acc = 1 |Dadv| x ∈Dadv 1[arg max q ψ (f θ (x )) ≡ y], where D adv is the adversarial dataset, y is the ground-truth label, arg max selects the class with the highest logits and 1(•) is the indicator function. Similarly, robust F1 score is calculated by: F1 = 1 |Dadv| x ∈Dadv v(arg max q ψ (f θ (x )), a), where v(•, •) is the F1 score between the true answer a and the predicted answer arg max q ψ (f θ (x )), and arg max selects the answer with the highest probability (see Rajpurkar et al. (2016) 

for details).

Implementation Details To demonstrate InfoBERT is effective for different language models, we apply InfoBERT to both pretrained RoBERTa Large and BERT Large . Since InfoBERT can be applied to both standard training and adversarial training, we here use FreeLB as the adversarial training implementation. InfoBERT is fine-tuned for 2 epochs for the QA task, and 3 epochs for the NLI task. More implementation details such as α, β, c h , c l selection can be found in Appendix A.1. 

4.3. ANALYSIS OF LOCAL ANCHORED FEATURES

We conduct an ablation study to further validate that our anchored feature regularizer indeed filters out nonrobust/unuseful information. As shown in Table 1 and 2 We also observe that those local anchored features extracted by our anchored feature regularizer, as expected, contribute more to the MI improvement. As shown in Figure 1 , the MI improvement of anchored features on adversarial test data ∆I R (red bar on the left) is higher than that of nonrobust/unuseful ∆I N (red bar on the right), thus confirming that local anchored features discovered by our anchored feature regularizer have a stronger impact on robustness than nonrobust/unuseful ones. We conduct more ablation studies in Appendix §A.2, including analyzing the individual impact of two regularizers, the difference between global and local features for IB regularizer, hyper-parameter selection strategy and so on.

5. CONCLUSION

In this paper, we propose a novel learning framework InfoBERT from an information theoretic perspective to perform robust fine-tuning over pre-trained language models. Specifically, InfoBERT consists of two novel regularizers to improve the robustness of the learned representations: (a) Information Bottleneck Regularizer, learning to extract the approximated minimal sufficient statistics and denoise the excessive spurious features, and (b) Local Anchored Feature Regularizer, which improves the robustness of global features by aligning them with local anchored features. Supported by our theoretical analysis, InfoBERT provides a principled way to improve the robustness of BERT and RoBERTa against strong adversarial attacks over a variety of NLP tasks, including NLI and QA tasks. Comprehensive experiments demonstrate that InfoBERT outperforms existing baseline methods and achieves new state of the art on different adversarial datasets. We believe this work will shed light on future research directions towards improving the robustness of representation learning for language models. For (b), we both theoretically and empirically demonstrate that we can improve the adversarial robustness by decreasing the mutual information of I(X i ; T i ) without affecting the benign accuracy much. For (c), we propose to align the local anchored features T kj (highlighted in Yellow) with the global feature Z by maximizing their mutual information I(T kj ; Z).

A.1 IMPLEMENTATION DETAILS

Model Detailsfoot_0 BERT is a transformer (Vaswani et al., 2017) based model, which is unsupervised pretrained on large corpora. We use BERT Large -uncased as the baseline model, which has 24 layers, 1024 hidden units, 16 self-attention heads, and 340M parameters. RoBERTa Large shares the same architecture as BERT, but modifies key hyperparameters, removes the next-sentence pretraining objective and trains with much larger mini-batches and learning rates, which results in higher performance than BERT model on GLUE, RACE and SQuAD. Standard Training Details For both standard and adversarial training, we fine-tune InfoBERT for 2 epochs on the QA task, and for 3 epochs on the NLI task. The best model is selected based on the performance on the development set. All fine-tuning experiments are run on Nvidia V100 GPUs. For NLI task, we set the batch size to 256, learning rate to 2 × 10 -5 , max sequence length to 128 and warm-up steps to 1000. For QA task, we set the batch size to 32, learning rate to 3 × 10 -5 and max sequence length to 384 without warm-up steps. Adversarial Training Detailsfoot_1 Adversarial training introduces hyper-parameters including adversarial learning rates, number of PGD steps, and adversarial norm. When combing adversarial training with InfoBERT, we use FreeLB as the adversarial training implementation, and set adversarial learning rate to 10 -1 or 4 * 10 -2 , adversarial steps to 3, maximal perturbation norm to 3 * 10 -1 or 2 * 10 -1 and initial random perturbation norm to 10 -1 or 0. Information Bottleneck Regularizer Details For information bottleneck, there are different ways to model p(t | x): 1. Assume that p(t | x) is unknown. We use a neural net parameterized by q θ (t | x) to learn the conditional distribution p(t | x). We assume the distribution is a Gaussian distribution. The Published as a conference paper at ICLR 2021 neural net q θ will learn the mean and variance of the Gaussian given input x and representation t. By reparameterization trick, the neural net can be backpropagated to approximate the distribution given the training samples. 2. p(t | x) is known. Since t is the representation encoded by BERT, we actually already know the distribution of p. We also denote it as q θ , where θ is the parameter of the BERT encoder f θ . If we assume the conditional distribution is a Gaussian N (t i , σ) for input x i whose mean is the BERT representation t i and variance is a fixed constant σ, the Eq.6 becomes L LIB = 1 N N i=1 log q ψ (y (i) | t (i) ) -β n k=1 -c(σ)||t (i) k -t (i) k || 2 2 + 1 n N j=1 c(σ)||t j -t k || 2 2 , where c(σ) is a positive constant related to σ. In practice, the sample t i from the conditional distribution Gaussian N (t i , σ) can be t i with some Gaussian noise, an adversarial examples of t i , or t i itself (assume σ = 0). We use the second way to model p(t | x) for InfoBERT finally, as it gives higher robustness improvement than the first way empirically (shown in the following §A.2). We suspect that the main reason is because the first way needs to approximate the distribution p(t | x) via another neural net which could present some difficulty in model training. Information Bottleneck Regularizer also introduces another parameter β to tune the trad-off between representation compression I(X i ; T i ) and predictive power I(Y ; T ). We search for the optimal β via grid search, and set β = 5 × 10 -2 for RoBERTa, β = 10 -3 for BERT on the NLI task. On the QA task, we set β = 5 × 10 -5 , which is substantially lower than β on NLI tasks, thus containing more word-level features. We think it is mainly because the QA task relies more on the word-level representation to predict the exact answer spans. More ablation results can be found in the following §A.2. Anchored Feature Regularizer Details Anchored Feature Regularizer uses α to weigh the balance between predictive power and importance of anchored feature. We set α = 5 × 10 -3 for both NLI and QA tasks. Anchored Feature Regularizer also introduces upper and lower threshold c l and c h for anchored feature extraction. We set c h = 0.9 and c l = 0.5 for the NLI task, and set c h = 0.95 and c l = 0.75 for the QA task. The experimental results are summarized in Table 6 . We can see that while both features can boost the model robustness, using local features yield higher robust accuracy improvement than global features, especially when adversarial training dataset is added. Hyper-parameter Search We perform grid search to find out the optimal β so that the optimal trade-off between representation compression ("minimality") and predictive power ("sufficiency") is achieved. An example to search for the optimal β on QA dataset is shown in Fingure 3, which illustrates how β affects the F1 score on benign and adversarial datasets. We can see that from a very small β, both the robust and benign F1 scores increase, demonstrating InfoBERT can improve both robustness and generalization to some extent. When we set β = 5 × 10 -5 (log(β) = -9.9), InfoBERT achieves the best benign and adversarial accuracy. When we set a larger β to further minimize I(X i ; T i ), we observe that the benign F1 score starts to drop, indicating the increasingly compressed representation could start to hurt its predictive capability.

A.2.2 ABLATION STUDY ON ANCHORED FEATURE REGULARIZER

Visualization of Anchored Words To explore which local anchored features are extracted, we conduct another ablation study to visualize the local anchored words. We follow the best hyperparameters of Anchored Feature Regularizer introduced in §A.1, use the best BERT model trained on benign datasets (MNLI + SNLI) only and test on the ANLI dev set. We visualize the local anchored words in Table 7 as follows. In the first example, we find that Anchored Features mainly focus on the important features such as quantity number "Two", the verb "playing" and objects "card"/"poker" to make a robust prediction. In the second example, the matching robust features between hypothesis and premise, such as "people", "roller" v.s. "park", "flipped upside" v.s. "ride", are aligned to infer the relationship of hypothesis and premise. 

A.2.3 ABLATION STUDY ON DISENTANGLING TWO REGULARIZERS

To understand how two regularizers contribute to the improvement of robustness separetely, we apply two regularizers individually to both the standard training and adversarial training. We refer InfoBERT trained with IB regularizer only as "InfoBERT (IBR only)" and InfoBERT trained with Anchored Feature Regularizer only as "InfoBERT (AFR only)". "InfoBERT (Both)" is the standard setting for InfoBERT, where we incorporate both regularizers during training. For "InfoBERT (IBR only)", we set α = 0 and perform grid search to find the optimal β = 5 × 10 -2 . Similarly for "InfoBERT (AFR only)", we set β = 0 and find the optimal parameters as α = 5 × 10 -3 , c h = 0.9 and c l = 0.5. The results are shown in Table 8 . We can see that both regularizers improve the robust accuracy on top of vanilla and FreeLB to a similar margin. Applying one of the regularizer can achieve similar performance of FreeLB, but the training time of InfoBERT is only 1/3 1/2 less than FreeLB. Moreover, after combining both regularizers, we observe that InfoBERT achieves the best robust accuracy.

A.2.4 EXAMPLES OF ADVERSARIAL DATASETS GENERATED BY TEXTFOOLER

We show some adversarial examples generated by TextFooler in Table 9 . We can see most adversarial examples are of high quality and look valid to human while attacking the NLP models, thus confirming our adversarial dastasets created by TextFooler is a strong benchmark dataset to evaluate model robustness. However, as also noted in Jin et al. (2020) , we observe that some adversarial examples look invalid to human For example, in the last example of Table 9, TextFooler replaces "stand" with "position", losing the critical information that "girls are standing instead of kneeling" and fooling both human and NLP models. Therefore, we expect that InfoBERT should achieve better robustness when we eliminate such invalid adversarial examples during evaluation. (24) It is easy to verify that φ(x) is a continuous, monotonically increasing, concave and subadditive function. Now, we can proceed with the proof of Theorem 3.2. (36) 



We use the huggingface implementation https://github.com/huggingface/transformers for BERT and RoBERTa. We follow the FreeLB implementations in https://github.com/zhuchen03/FreeLB.



where f θ is a deterministic feature extractor. The performance gap between benign and adversarial data |I(Y ; T ) -I(Y ; T )| is bounded above by

, adding adversarial data in the training set can significantly improve model robustness. To find out what helps improve the robustness from the MI perspective, we first calculate the MI between anchored features and global features 1 M M j=1 I(T kj ; Z) on the adversarial test data and benign test data, based on the model trained without adversarial training data (denoted by I R and I R ). We then calculate the MI between nonrobust/unuseful features and global features 1 M M i=1 I(T ki ; Z) on the adversarial test data and benign data as well (denoted by I N and I N ). After adding adversarial examples into the training set and re-training the model, we find that the MI between the local features and the global features substantially increases on the adversarial test data, which accounts for the robustness improvement.

Figure 2: The complete objective function of InfoBERT, which can be decomposed into (a) standard task objective, (b) Information Bottleneck Regularizer, and (c) Local Anchored Feature Regularizer. For (b), we both theoretically and empirically demonstrate that we can improve the adversarial robustness by decreasing the mutual information of I(X i ; T i ) without affecting the benign accuracy much. For (c), we propose to align the local anchored features T kj (highlighted in Yellow) with the global feature Z by maximizing their mutual information I(T kj ; Z).

Figure 3: Benign/robust F1 score on benign/adversarial QA datasets. Models are trained on the benign SQuAD dataset with different β.

For any a, b ∈ [0, 1], |a log(a) -b log(b)| ≤ φ(|a -b|),(23)where φ(•) : R + → R + is defined as

We use the fact that|I(Y ; T ) -I(Y ; T )| ≤ |H(T | Y ) -H(T | Y )| + |H(T ) -H(T )| (25)and bound each of the summands on the right separately.We can bound the first summand as follows:|H(T | Y ) -H(T | Y )| ≤ y p(y)|H(T | Y = y) -H(T | Y = y)| y) log p(t | y) -q(t | y) log q(t | y)| | x)[p(x | y) -q(x | y)]|),(30)wherep(x | y) = p(y | x)p(x) x p(y | x)p(x) (31) q(x | y) = p(y | x)q(x) x p(y | x)q(x) . (32)Since x∈X ∪X p(x | y) -q(x | y) = 0 for any y ∈ Y, we have that for any scalar a,| x p(t | x)[p(x | y) -q(x | y)t | x) -a)(p(x | y) -q(x | y)x | y) -q(x | y)) 2 . (35) Setting a = 1 |X -X | x∈X ∪X p(t | x) we get |H(T | Y ) -H(T | Y ) ≤ y p(y) t φ V (p(t | x ∈ X ∪ X ) • ||p(x | y) -q(x | y)|| 2 ,

Adversarial DatasetsThe following adversarial datasets and adversarial attacks are used to evaluate the robustness of InfoBERT and baselines. (I) Adversarial NLI (ANLI)(Nie et al., 2020) is a large-scale NLI benchmark, collected via an iterative, adversarial, human-and-model-in-theloop procedure to attack BERT and RoBERTa. ANLI dataset is a strong adversarial dataset which can easily reduce the accuracy of BERT Large to 0%. (II) Adversarial SQuAD(Jia & Liang, 2017) dataset is an adversarial QA benchmark dataset generated by a set of handcrafted rules and refined by crowdsourcing. Since adversarial training data is not provided, we fine-tune RoBERTa Large on benign SQuAD training data(Rajpurkar et al., 2016) only, and test the models on both benign and adversarial test sets. (III) TextFooler(Jin et al., 2020) is the state-of-the-art word-level adversarial attack method to generate adversarial examples. To create an adversarial evaluation dataset, we sampled 1, 000 examples from the test sets of SNLI and MNLI respectively, and run TextFooler against BERT Large and RoBERTa Large to obtain the adversarial text examples.

Robust accuracy on the ANLI dataset. Models are trained on the benign datasets (MNLI + SNLI) only. 'A1-A3' refers to the rounds with increasing difficulty. 'ANLI' refers to A1+A2+A3.

Results of the first setting are summarized in Table1. The vanilla RoBERTa and BERT models perform poorly on the adversarial dataset. In particular, vanilla BERT Large with standard training achieves the lowest robust accuracy of 26.5% among all the models. We also evaluate the robustness improvement by performing adversarial training during fine-tuning, and observe that adversarial training for language models can improve not only generalization but also robustness.

Robust accuracy on the adversarial SNLI and MNLI(-m/mm) datasets generated by TextFooler based on blackbox BERT/RoBERTa (denoted in brackets of the header). Models are trained on the benign datasets (MNLI+SNLI) only.

Robust F1/EM scores based on RoBERTa Large on the adversarial SQuAD datasets (AddSent and AddOne-Sent). Models are trained on standard SQuAD 1.0 dataset.Evaluation on Adversarial SQuADPrevious experiments show that InfoBERT can improve model robustness for NLI tasks. Now we demonstrate that InfoBERT can also be adapted to other NLP tasks such as QA in Table4. Similar to our observation on NLI dataset, we find that InfoBERT barely hurts the performance on the benign test data, and even improves it in some cases. Moreover, InfoBERT substantially improves model robustness when presented with adversarial QA test sets (AddSent and AddOneSent). While adversarial training does help improve robustness, InfoBERT can further boost the robust performance by a larger margin. In particular, InfoBERT through standard training achieves the state-of-the-art robust F1/EM score as 78.5/72.9 compared to existing adversarial training baselines, and in the meantime requires only half the training time of adversarialtraining-based methods.

The neural MI estimator used by infoNCE uses two-layer fully connected layer to estimate the MI with the intermediate layer hidden size set to 300. As discussed in §A.1, we have two ways to model p(t | x): (i) using an auxiliary neural network to approximate the distribution; (ii) directly using the BERT encoder f θ to calculate the p(t | x). Thus we implemented these two methods and compare the robustness improvement in Table5. To eliminate other factors such as Anchored Feature Regularizer and adversarial training, we set α = 0, β = 5 × 10 -2 and conduct the following ablation experiments via standard training on standard datasets. We observe that although both modeling methods can improve the model robustness, modeling as BERT encoder gives a larger margin than the Auxiliary Net. Moreover, the second way barely sacrifices the performance on benign data, while the first way can hurt the benign accuracy a little bit. Therefore, we use the BERT Encoder f θ to model the p(t | x) in our main paper. Robust accuracy on the ANLI dataset. Here we refer "Standard Datasets" as training on the benign datasets (MNLI + SNLI) only. "Vanilla" refers to the vanilla BERT trained without Information Bottleneck Regularizer.

These anchored feature examples confirm that Anchored Feature Regularizer is able to find out useful and stable features to improve the robustness of global representation.

Local anchored features extracted by Anchored Feature Regularizer.

6. ACKNOWLEDGEMENT

We gratefully thank the anonymous reviewers and meta-reviewers for their constructive feedback. We also thank Julia Hockenmaier, Alexander Schwing, Sanmi Koyejo, Fan Wu, Wei Wang, Pengyu Cheng, and many others for the helpful discussion. This work is partially supported by NSF grant No.1910100, DARPA QED-RML-FP-003, and the Intel RSA 2020.

annex

Input (red = Modified words, bold = original words.)

Valid Adversarial Examples

Premise: A young boy is playing in the sandy water.Original Hypothesis: There is a boy in the water.Adversarial Hypothesis: There is a man in the water.

Model Prediction: Entailment → Contradiction

Premise: A black and brown dog is playing with a brown and white dog .Original Hypothesis: Two dogs play. Adversarial Hypothesis: Two dogs gaming.

Model Prediction: Entailment → Neutral

Premise: Adults and children share in the looking at something, and some young ladies stand to the side.Original Hypothesis: Some children are sleeping.Adversarial Hypothesis: Some children are dreaming.

Model Prediction: Contradiction → Neutral

Premise: Families with strollers waiting in front of a carousel.Original Hypothesis: Families have some dogs in front of a carousel.Adversarial Hypothesis: Families have some doggie in front of a carousel.

Invalid Adversarial Examples

Premise: Two girls are kneeling on the ground.Original Hypothesis: Two girls stand around the vending machines. Adversarial Hypothesis: Two girls position around the vending machinery.Model Prediction: Contradiction → Neutral We first state two lemmas.Lemma A.1. Given a sequence of random variables X 1 , X 2 , ..., X n and a deterministic function f , then ∀ i, j = 1, 2, ..., n, we haveProof. By the definition,Since f is a deterministic function,Lemma A.2. Let X = [X 1 ; X 2 ; ...; X n ] be a sequence of random variables, andbe a sequence of random variables generated by a deterministic function f . Then we haveProof. Since X = [X 1 ; X 2 ; ...; X n ] and T = [T 1 ; T 2 ; ...; T n ] are language tokens with its corresponding local representations, we havewhere the first inequality follows because conditioning reduces entropy, and the last inequality is because I(X i ; T i ) ≥ I(X j ; T i ) based on Lemma A.1.Then we directly plug Lemma A.2 into Theorem 3.1, we have the lower bound of L IB asA.3.2 PROOF OF THEOREM 3.2We first state an easily proven lemma, where for any real-value vector a = (a 1 , ..., a n ), V (a) is defined to be proportional to the variance of elements of a:p(t | x ∈ X ∪ X ) stands for the vector in which entries are p(t | x) with different values of x ∈ X ∪ X for a fixed t, and p(x | y) and q(x | y) are the vectors in which entries are p(x | y) and q(x | y), respectively, with different values of x ∈ X ∪ X for a fixed y.Sinceit follows thatMoreover, we havewhere the first inequality is because sample mean is the minimizer of the sum of the squared distances to each sample and the second inequality is due to the subadditivity of the square root function. Using the fact that φ(•) is monotonically increasing and subadditive, we getNow we explicate the process for establishing the bound for y p(yand the one for y p(y) t φ 2 V (p(t | x ∈ X )) can be similarly derived.By definition of V (•) and using Bayes' theorem p(tfor x ∈ X , we have thatDenoting 1 = (1, ..., 1), we have by the triangle inequality thatFrom an inequality linking KL-divergence and the l 1 norm, we have thatPlugging Eq. ( 52) into Eq. ( 51) and using Eq. ( 43), we have the following bound:where B = We will first proceed the proof under the assumption that Bp(t) √ d t ≤ 1 e for any t. We will later see that this condition can be discarded. If Bp(t)where the last inequality is due to an easily proven fact that for any x > 0, x log( 1 x ) ≤ √ x. We p(t) and d(t) are vectors comprising p(t) and d t with different values of t, respectively.Using the following two inequalities:andUsing the equalitywe reach the following boundPlug Lemma A.2 into the equation above, we haveWe now show the bound is trivial if the assumption that Bp(t) √ d t ≤ 1 e does not hold. If the assumption does not hold, then there exists a t such that Bp(t)for any t, we get that I(X; T ) ≥ 1 eB . Since |T | ≥ 1 and C ≥ 0, we get that our bound in Eq. ( 63) is at least. It can be verifed that f (c) > 0 if c > 0. Since B > 4 2 log(2) by the definition of B, we have f (B) > f (4 2 log(2)) > 0.746. Therefore, we haveTherefore, if indeed Bp(t) √ d t > 1 e for some t, then the bound in Eq. ( 63) is trivially true, since H(T | Y ) is within [0, log(|T |)]. Similarly, we can establish a bound forwheremin x∈X q(x) .Plugging Eq. ( 73) and Eq. ( 65) into Eq. ( 42), we get(74)Now we turn to the third summand in Eq. ( 25), we have to bound |H(T ) -H(T )|.Recall the definition of -bounded adversarial example. We denote the set of the benign data representation t that are within the -ball of t by Q(t ). Then for any t ∈ Q(t ), we havefor i = 1, 2, ..., n. We also denote the number of the -bounded adversarial examples around the benign representation t by c(t). Then we have the distribution of adversarial representation t as follows: 25), Eq. ( 74) and Eq. ( 85), we prove the bound in Theorem 3.2.

