ROBUSTNESS EXPLORATION OF SEMANTIC INFORMA-TION IN ADVERSARIAL TRAINING

Abstract

In this paper, we look into the problem of adversarial robustness from the semantic information perspective. We present a novel insight that adversarial attacks destroy the correlation between visual representations and semantic word vectors, and adversarial training fixed it. We further find that the correlation between robust features of different categories is consistent with the correlation between corresponding semantic word vectors. Based on that, we introduce the semantic information to assist model training and propose Semantic Constraint Adversarial Robust Learning (SCARL). Firstly, we follow an information-theoretical lens to formulate the mutual information between the visual representation and the corresponding semantic word vector in the embedding space to bridge the information gap. We further provide a differentiable lower bound to optimize such mutual information efficiently. Secondly, we propose a novel semantic structural constraint, encouraging the trained model to keep the structure of visual representations consistent with that of semantic word vectors. Finally, we combine these two techniques with adversarial training to learn robust visual representation. Experimentally, we conduct extensive experiments on several benchmarks, demonstrating that semantic information is indeed beneficial to model robustness.

1. INTRODUCTION

Word embedding is one of the critical technologies in natural language processing (Pennington et al., 2014; Goldberg & Levy, 2014; Tang et al., 2014) . It statistics the co-occurrence frequency between pairs of words within a given context in a large-scale training corpus to learn an encoder that can infer vectors for any words in a learned embedded space. A well-trained word embedding model is usually regarded as a knowledge graph (Matthews & Matthews, 2001; Wang et al., 2018) , in which the meaning of a word is determined by its relationship to other words in the learned vector space. That is, analogies and correlations between words can be presented by the learned vectors (Hohman et al., 2018; Chersoni et al., 2021) , which help the model associate seen objects with unseen objects. Recently, several works have explored using semantic word/text embedding as supervision signs for zero-shot learning and visual-linguistic pre-training, and have achieved impressive successes in various AI tasks (Qiao et al., 2017; Wang et al., 2018; Radford et al., 2021; Wang et al., 2022) . On the other hand, deep neural networks are usually vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015; Madry et al., 2018; Bhojanapalli et al., 2021) , which severely limits their applications in many security scenarios. Fortunately, some studies (Radford et al., 2021; Yu et al., 2022) have shown that the visual model trained with semantic supervised information has much more robust to distribution shift and adversarial examples than standard trained models. As a result, these preliminaries raise a natural question: What is the impact of semantic informations on adversarial robustness? To answer this question, we explore the relationship between semantic information and model robustness from two aspects: distribution and structural relevance. Firstly, we apply the canonical correlation analysis (CCA) (Hotelling, 1992) , which can reflect connections between two random variables. to analyze the distribution relevance between the visual representation and the corresponding semantic word vector. We mainly analyze the correlation coefficient of natural and adversarial image representation with semantic word vector under non-robust and robust models (Madry et al. , 2018). The results in Figure 1 indicate that, for the non-robust model, the representation of the natural image has a high correlation with its corresponding word vector. In contrast, the adversarial image has a lower correlation. This result means that the adversarial attack will destroy the semantic information from the non-robust model, which responds to the previous observations in (Zhang & Zhu, 2019; Ilyas et al., 2019) via a new perspective. For the robust model trained on adversarial examples (Madry et al., 2018) , the correlation between the visual representation and word vector has a significant enhancement. As a result, we can summarize a novel intriguing property: the more robust model, the stronger the correlation. Secondly, to verify the semantic word vectors could present the analogies and correlations between words, we visualize the similarity matrix of word vectors generated by a trained Glove (Pennington et al., 2014) on CIFAR-10, which is shown in Figure 2 (c), As can be seen from the figure, the correlation between category 3 (Cat)foot_0 and category 5 (Dog) is stronger than the correlation between category 3 (Cat) and category 9 (Truck). We further visualize the similarity between different categories of non-robust features , and the similarity of robust features. which are shown in Figure 2 (a) and (b) respectively. We can observe that the robustness feature can also reflect the relatedness between categories, and it is similar to the relatedness reflected by the semantic word vector. However, the non-robust features cannot reflect the association between categories. Recently, CLIP (Radford et al., 2021) uses large-scale image-text pairs to jointly learn semantic representations. Therefore, we also visualize the semantic representation correlation matrix learned by CLIP. which is shown in Figure 2 (d) . the semantic representations learned by CLIP present analogies and correlations between categories, but there is a certain gap with the real semantics. Taking our analysis into consideration, we introduce the semantic information learned by word embedding into model training, which aims at improving the robustness of the current neural networks (He et al., 2016a) . We follow an information-theoretical perspective to bridge the information gap between visual representations and semantic word vectors, which consists of two key techniques. First, we use mutual information to enhance semantic information in the visual representation, which aims to enhance the correlations via distributional information. Second, we introduce geometric constraints to align the manifold information from the visual representation space to the word vector space, which aims to enhance the correlations via structural information. Finally, we propose the Semantic Constraint Adversarial Robust Learning (SCARL) framework, which combines the above two techniques with adversarial training. Our contributions are summarized as follows: • We are the first to explore the correlation between semantic word information and the deep model via the classical CCA method. We find that, the more robust the model, the stronger the correlation between visual representation and semantic word vector. • We analyze the correlations between different categories of image features, and find that the robust features can reflect the semantic association between categories, which is consistent with the word vector, but the not-robust features can not. • We introduce a Semantic Constraint Adversarial Robust Learning (SCARL) framework that captures the distributional and structural information from semantic word vectors via mutual information optimization and geometry constraints, to promote robustness. • We conduct extensive experiments on three widely-used benchmarks. The results show that the proposed SCARL behaves more robust than several state-of-the-art techniques, which demonstrates semantic information indeed helps improve robustness.

2.1. PROBLEM SETTING AND NOTATIONS

The robustness against adversarial example attacks has attracted a lot of attention, and various training algorithms have been proposed to mitigate the problem (Madry et al., 2018; Dhillon et al., 2018; Yang et al., 2019; Song et al., 2019; Wu et al., 2020) . However, none of these methods could survive the latest gradient masking-based attack (Athalye et al., 2018) . In this competition between attackers and defenders, adversarial training (Madry et al., 2018; Zhang et al., 2019) stands out as a promising solution to defend against those strongest adversarial attacks. The approach augments the training data with adversarial inputs produced by an adversarial attack, which can be formulated as a min-max optimization problem: min θ E max xadv L F θ x adv , y , where F θ is a DNN model with parameters θ, and L is the loss function of the DNN, x adv denote adversarial examples. On the other hand, several works have explored using semantic word embedding as supervision information to improve model performance (Gan et al., 2020; Radford et al., 2021) . However, most of them are pre-trained on large-scale image-text datasets and then transferred to downstream tasks. There are several critical limitations of adversarial training: 1) Requesting large-scale image-text datasets, which are collected from the web, often with noisy data; 2) Inconsistency of representation space between visual image and semantic word; 3) There is no reliable robustness evaluation method for multimodal information. To overcome these limitations, Firstly, we only focus on the essential image classification task, and we use the CIFAR (Krizhevsky et al., 2009) and TinyImageNet (Deng et al., 2009) datasets, which have the one-hot label and the corresponding semantic words. Secondly, for the semantic word vector, we use the Glove (Pennington et al., 2014) or CLIP (Radford et al., 2021) to embed the names or descriptions of the target dataset's classes. Then, we project visual representation and semantic word vector into the consistency manifold, and introduce two constraint strategies from the information and structural perspective to ensure the visual representation manifold is consistent with that of the semantic word vector. Lastly, we combine the proposed semantic constraint techniques with adversarial training to learn robust representation, which is called Semantic Constraint Adversarial Robust Learning (SCARL). The overall framework is shown in Figure 3 . Figure 3 : The framework of our SCARL, the visual representation and semantic word vector are extracted by a DNN and a word embedding model respectively. We maximize the mutual information between visual representation and the corresponding semantic word vector by Lemma 2.1, and maximize the mutual information between visual representation and one-hot label by minimizing cross-entropy loss. By geometric structure constraints, we further align the manifold information from the visual representation space to the word vector space. Best viewed in color. Notations. We define random variables X, Y, Z ∈ p D (x, y, z), where X represents the image input, Y represents the one-hot class label. Z represents the corresponding semantic words. p D is the data distribution, and x, y, z are the observed values. For the image classification task, our goal is to build a classifier q(y|F θ (x)), where F θ (x) is referred to as our objective model. We define random variable S = F θ (X) as the visual representation of X extracted by classifier F θ , and define random variable T = E θ (Z) as word vector of Z encoded by the word embedding model E θ . Our goal is to train F θ (x) such that X is capable of predicting Y .

2.2. MAXIMIZING LOWER BOUND ON SEMANTIC MUTUAL INFORMATION

In the classification task, the classifier is often trained with cross-entropy loss, which can be viewed as maximizing the mutual information I(S; Y ) = log p(S,Y ) p(S)p(Y ) , where p(S, Y ) denotes the joint probability distribution of S, Y , and both p(S) and p(Y ) are the marginals. According to variational inference, we can use q(y|s) as a variational distribution of p(y|s), and derive a variational lower bound on I(S; Y ) as follows: I(S; Y ) = H(Y ) -E s,y∼p D [-log q(y | s)] + KL p(• | s)∥q(• | s) ≥ H(Y ) -E s,y∼p D [-log q(y | s)], where H(Y ) is a constant measuring the shannon entropy of Y , and E S,Y [-log q(y | s)] is essentially the cross-entropy loss using q(y, s) for classification. Therefore, the objective of maximizing I(S; Y ) can be achieved by minimizing E S,Y [-log q(y | s)] instead. However, maximizing I(S; Y ) does not take into account the semantic information of the semantic words. To bridge visual representation and semantic word vector. We follow an informationtheoretical lens to look into the information gap between visual representation and semantic word vector, and propose maximum semantic mutual information to improve model training. We formulate our objective as follows: max I(S; Y, T ). (3) Our goal is to train the model such that S is capable of predicting Y , as well as learning the semantic information from T . However, since the objective of equation 3 is difficult to optimize directly, we decompose it into two terms as follows: I(S; Y, T ) = I(S; Y ) + I(S; T | Y ), where I(S; Y ) measures how well the model can predict the one-hot label, and I(S; T | Y ) measures how well the visual representation can learn semantic information from the semantic word vector. The I(S; Y ) can be optimized using the cross-entropy loss. For optimizing I(S; T | Y ), inspired by (Hjelm et al., 2018; Tian et al., 2020) , we use contrastive learning to derive a lower bound -L inf o on the conditional mutual information. To achieve a tractable objective, We introduce the following Lemma: Lemma 2.1. Given 1 congruent pair {x, z, y|C y = 1} and N incongruent pairs {x i , z i , y i |C y = 0} N i=1 , which is sampled i.i.d. from the distribution p D (x, y, z), s = F θ (x) and t = E θ (z), I(S; T | Y ) is lower bounded by -L inf o = E q(S,T |C Y =1) [log e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c ] + N E q(S,T |C Y =0) [log(1 - e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c )], where c is the cardinality of the dataset and τ is a temperature that adjusts the concentration level. g s and g t are nonlinear projection heads to transform the representation into the same manifold space and further normalize by the L2-norm. The proof is shown in the supplementary material. By leveraging Lemma 2.1, L inf o can be computed using a batch of samples and then minimized for maximizing I(S; T | Y ). We further analyse the characteristics of semantic mutual information and its corresponding differential lower bound. From Lemma 2.1, we add E q(S,T |C Y =0) log (1 -e gs (s) ′ g t (t)/τ e gs(s) ′ g t (t)/τ +c ) to relax the bound. Then, our final bound in Lemma 2.1 contains two parts. The first part is to maximize the mutual information between an image and the corresponding semantic words; the second part is to minimize the mutual information between an image and the mismatched semantic words. Intuitively, the formulation of Lemma 2.1 is similar to the fundamental goal for metric learning: learn a representation that is close in some metric space for "positive" pairs and push apart the representation between "negative" pairs. Different from traditional metric learning, our approach is based on the perspective of information theory and can be seen as special metric learning by optimizing the mutual information.

2.3. SEMANTIC STRUCTURE CONSTRAINT

In the last subsection, we bridged semantic word vector and visual representation through mutual information, but ignored the structure relationship in linguistic words. Based on our findings in Figure 2 , the correlation between robust features is similar to that of semantic word vectors. Another reason for using structural constraints is that when a non-robust model is attacked, the manifold of representations is distorted with the disruption of the space structure; thus, maintaining the structure stability is beneficial to the defense. Therefore, we proposed semantic structure constraint loss to keep the structure of images consistent with that of semantic words. Given a dataset D with K classes. Let M ∈ R d be a lower-dimensional manifold, where {s, t ∈ M |s = F θ (x), t = E θ (z), x, z ∈ D}. s was defined using the output of any layer of the network (e.g., a hidden output of the logic layer). We define the visual representation centers as S image = {s 1 , . . . , s k |s i ∈ M}, which is the set of K vertices representative of dataset D. Each vertex s i is the centroid vector representing one class of feature vectors within a neighbourhood region. Similarly, we define the word vector centers T word = {t 1 , . . . , t k |t i ∈ M}. To get T word , we first randomly initialize its value by picking a random position in manifold space. Then, we update t i iteratively using the momentum rule: t i new = t i + m • E θ (z|y = i) -t i , i = 1, • • • , K, where t i new denotes the updated vertex, and the hyperparameter m ∈ [0, 1) is momentum coefficient. The equation 6 ensures the vertex has a stable step towards the center as training goes on. For visual representation centers, we construct a new center s i new for the observed values of each vertex in each training epoch. s i new is estimated by the averaging representation of the same class in the mini-batch samples: s i new = 1 N N n F θ (x n |y n = i), i = 1, • • • , K. Then, let the T word restrict the S image to make them consistent. To this end, we propose two geometry relation matching metrics: distance-wise and angle-wise. Both of them aim to match the geometry structure information between visual representation and semantic word vector. Distance-wise Matching. For distance-wise metrics, given a pair of representation center < s i , s j >, and a distance-wise function ϕ D , we separately calculate the distance (e.g., Euclidean distance) between the two image centers in the representation space: ϕ D (s i , s j ) = 1 µ • ∥s i -s j ∥ 2 , ( ) where µ is a normalization factor for distance. To focus on the relative distance among other pairs, we set µ to the average distance between pairs from S image . µ is defended as: µ = 1 |S 2 | s i ,s j ∈S ∥s i -s j ∥ 2 . Similarly, we can calculate the ϕ D (t i , t j ). Using the distance-wise potential measured in both image and word centers, a distance-wise matching loss is defined as: L D = s i ,s j ∈S,t i ,t j ∈T l δ ϕ D (s i , s j ), ϕ D (t i , t i ) , where l δ is smooth L1 loss (Ren et al., 2015) , The distance-wise loss matches the relationship of centers by penalizing the distance difference between their output in manifold space. Angle-wise Matching. For angle-wise metrics, given a triplet of training centers < s i , s j , s k >, an angle-wise relational potential measures the angle formed by the visual representation and word vector centers in the output manifold space: ϕ A (s i , s j , s k >) = cos∠ijk =< e ij , e kj >, where e ij = s i -s j ∥s i -s j ∥2 , e kj = s k -s j ∥s k -s j ∥2 . Using the angle-wise potentials measured in both visual representation and word vector, an angle-wise matching loss is formulated as: L A = s i ,s j s k ∈S,t i ,t j ,t k ∈T l δ ϕ A (s i , s j , s k ), ϕ A (t i , t j , t k ) . The angle-wise loss matches the structure of the visual manifold with semantic structure by penalizing angular differences.

2.4. ADVERSARIAL TRAINING WITH SEMANTIC CONSTRAINT

Finally, we combine the proposed semantic constraint techniques with the adversarial training framework to learn robust representation, which is called Semantic Constraint Adversarial Robust Learning (SCARL). Our goal is to maximize equation 4 and maintain the manifold structure under the adversarial setting. Maximizing Adversarial I(S adv ; Y ): As mentioned in 2.2, maximizing I(S adv ; Y ) can be achieved by minimizing a cross-entropy loss instead. To encourage adversarial robustness, this cross-entropy loss can be upgraded to maximize Kullback-Leibler divergence between natural examples x nat and adversarial examples x adv as in (Zhang et al., 2019; Dong et al., 2021) . Therefore, the objective function can be formulated as follows: L adv = L ce (F θ (x adv ), y) + β • KL P (•|x adv )∥P (•|x nat ) . ( ) Maximizing Adversarial I(S adv ; T | Y ): To maximize the semantic mutual information I(S adv ; T | Y ). According to Lemma 2.1 , we formulate the objective to minimize L inf o as follows: L inf o = E q(S,T |C Y =1) [-log e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c ] + N E q(S,T |C Y =0) [-log(1 - e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c )], where s adv = F θ (x adv ), t = E θ (z y ), and c is cardinality of the dataset and τ is a temperature that adjusts the concentration level. g s and g t are nonlinear projection heads. Restricting S image with T word : Based on the geometric matching, we formulate the semantic structure constraint loss as follows:  L struc = L D + L A (14) L obj = L adv + λ 1 • L inf o + λ 2 • L struc , where λ 1 and λ 2 are hyperparameters to control the relative importance among the three losses. The training process not only maximizes the mutual information between visual representation and semantic word vector in the consistent manifold space, and also captures the semantic structure information, which enables our model to learn more semantic and robust representations that are not sensitive to the input perturbation.

3.1. EXPERIMENTAL SETUP

Datasets. We compare the proposed methods with the baselines on widely-used benchmark datasets, including: CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and TinyImageNet (Deng et al., 2009) . These datasets can easily obtain one-hot labels and the corresponding semantic words. Baselines Setup. We compare the robustness of our proposed SCARL with some classical adversarial training methods, including standard AT (Madry et al., 2018) , TRADES (Zhang et al., 2019) . triplet loss adversarial training (TLA) (Mao et al., 2019) and adversarial training with contrastive learning (ACL) (Jiang et al., 2020) . We test the defence under different white-and black-box attacks including FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2018) , CW (Carlini & Wagner, 2017) , Square (Andriushchenko et al., 2019) and AutoAttack (Croce & Hein, 2020) . Model Details. We adopt a pre-activation ResNet-18 He et al. (2016b) as the image feature extractor, then follow a nonlinear projection with one additional hidden layer (and ReLU activation). The output vector (64-D) is normalized by its L2-norm. This is the representation of the image. For the word embedding model, we use the Glove (Pennington et al., 2014) , which is trained with Wikipedia (Scheepers, 2017) and Gigaword (Rush et al., 2015) datasets. The word vector is also projected into a vector (64-D) by a nonlinear projection head and normalized. This is the representation of semantic words. Training Details. For training, the initial learning rate is γ = 0.1, and the learning rate schedule is [0.1, 0.01, 0.001] for all datasets, the decay epoch schedule is [75, 90] for CIFAR and [50, 55] for TinyImageNet. The training scheduling of 100 epochs for CIFAR and 60 epochs for TinyImageNet. Following (Rice et al., 2020) , we adopt the common setting that the threat model with radius 8/255, with the PGD attack taking 10 steps of size 2/255 on all datasets. We train with the SGD optimizer with momentum 0.9, weight decay 5 • 10 -4 using a batch size of 128. We performed standard data augmentation including random crops and random horizontal flips during training. For hyperparameter, we set the hyperparameter λ 1 = 1.0 and λ 2 = 1.0 along a Gaussian ramp curve in equation 15, the β = 6 in equation 12. Our implementation is based on PyTorch and the code to reproduce our results will be available. To verify the impact of the semantic information on model robustness. we train a natural model and robust model with semantic information (SemInfo), and compare it with another model trained without semantic information. Table 2 shows the results. We find that the model using semantic information is more robust against simple attacks such as FGSM, but still not robust against stronger attacks. Further, we combine semantic information with standard adversarial training (Madry et al., 2018) , The Seminfo improves the robustness by 1.25% under standard AutoAttack, which demonstrated the semantic information is beneficial to the model's robustness.

3.2. MAIN RESULTS

In order to further verify the influence of semantic information on the robustness of the model, we calculate the CCA of our SCARL where the model trained with the equation 15, the CCA is 0.84foot_1 , which means visual representations learned by our SCARL are more semantic information than standard adversarial training (Madry et al., 2018) . We further report the results under different whiteand black-box attacks at the best checkpoint, which is selected based on the performance under the PGD-10 attack. The results are shown in Table 1 . The proposed method SCARL achieves the best robustness against the strongest attacks AutoAttack on both CIFAR and TinyImageNet, where every small margin of improvement is significant. We further verify the robustness of our method under adaptive attack (Akhtar & Mian, 2018) where the attacker has the knowledge and white-box access to the word embedding model. We use the PGD-based adaptive attack to evaluate AT, TRADES and our SCARL model under ResNet-18 on CIFAR10. The results are shown in Table 3 and demonstrate our SCARL can still defend against the adaptive attack. We also tested the performance under WideResNet (Zagoruyko & Komodakis, 2016), the results are shown in supplementary material since space limitations. In addition, We plot the results of testing accuracy over epochs and evaluate adversarial accuracy against PGD attacks under different attack budgets with a fixed attack step of 10, and we also conduct experiments using PGD attacks with different attack iterations with a fixed attack budget of 8/255. The results shown in Figure 4 (a-c). Our SCARL is better than standard AT and TRADES at a larger budget, besides, our SCARL is stable against large iterations attacks, e.g., PGD attack with 500 step iterations. Therefore, the results demonstrate the effectiveness of our proposed SCARL.

3.3. ABLATION STUDIES

For ablation studies, all comparative experiments were performed on the CIFAR-10, and all other hyper-parameters were kept exactly the same other than the contrast variable used. Ablation Different Techniques. We validate the two proposed techniques in SCARL, of which results are given in Table 4 , From Table 4 , we can observe significant performance gains by each technique. This confirms the merits of our proposed techniques. Comparison with InfoNCE. The proposed L inf o is similar to InfoNCE (Hjelm et al., 2018) . In-foNCE is an alternative contrastive objective that selects a single positive out from a set of distractors via a softmax function. We compare InfoNCE with our L inf o when using the same number of negatives. The Table 4 show that our L inf o outperforms InfoNCE. This confirms the merits of the proposed L inf o . Effect of Semantic Embeddings. We use different word embedding to verify semantic information is beneficial to robustness. we design four embedding schemes: a) Random: a random vector as the semantic word vector; b) NN: word vector is generated by a learnable neural network embedded layer; c) CLIP: word vector is generated by a trained CLIP model; d) Glove: word vector is generated by a trained Glove model. The results are shown in Table 5 . It is observed that the stronger the semantic information, the higher the robustness. Effect of Batch Size. Theoretically, the proposed L inf o and InfoNCE can benefit from a large batch size. To evaluate the effect of batch sizes, we test six values of batch size and show the results in Figure 4(d) . For the standard adversarial training, as the mini-batch sizes get larger, the performance drops dramatically. This proves that adversarial training is not suitable for large batch sizes on CIFAR-10. The other reason for the L inf o is not to benefit from large batch size is that, under a given dataset, the semantic words of the data category are fixed, when calculating L inf o , even if using the large batch size, the negative samples in L inf o is still restricted by the number of semantic words.

4. CONLUSION

In this paper, we analyzed the relationship between semantic information and model robustness from a distribution and structural perspective. which shows the robustness of image representation is closely related to semantic information. Based on our findings, we proposed Semantic Constraint Adversarial Robustness Learning (SCARL), which learns visual representation by capturing semantic word distribution and structure information. Experimentally, we demonstrated the effectiveness of the proposed SCARL in multiple benchmark datasets and revealed that semantic information could indeed improve model robustness. A THE PROOF OF LEMMA 2.1 Lemma 2.1. Given 1 congruent pair {x, z, y|C y = 1} and N incongruent pairs {x i , z i , y i |C y = 0} N i=1 , which is sampled i.i.d. from the distribution p D (x, y, z), s = F θ (x) and t = E θ (z), I(S; T | Y ) is lower bounded by -L inf o = E q(S,T |C Y =1) [log e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c ] + N E q(S,T |C Y =0) [log(1 -e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c )], (16) where c is the cardinality of the dataset and τ is a temperature that adjusts the concentration level. g s and g t are nonlinear projection heads to transform the representation into the same manifold space and further normalize by L2-norm. Proof. For deriving the lower bound, we define the mutual information containing the joint distribution p(S, T ) and the product of marginal distributions p(S)p(T ). Therefore, we can maximize the mutual information between visual representation and word vector via optimizing KL-divergence between these distributions. To this end, we define a distribution q with the latent variable C Y . When C Y = 1, the pair ⟨S, T ⟩ have the same one-hot label, which is drawn from the joint distribution. On the contrary, when S and T have different one-hot labels (C Y = 0), the pair is independent of each other, which is drawn from the product of marginal. As a result, the formulations can be written as: q(S, T |C Y = 1) = p(S, T |Y ), q(S, T |C Y = 0) = p(S|Y )p(T |Y ). Suppose that, there are 1 congruent pair, which is drawn from the joint distribution, and N incongruent pairs, which are drawn from the product of marginal. We can define the priors on the latent C are: q(C Y = 1) = 1 1 + N , q(C Y = 0) = N 1 + N . By using Bayes' rule, the posterior for class C Y =1 is: q(C Y = 1|S, T ) = q(S, T |C Y = 1)q(C Y = 1) q(S, T |C Y = 0)q(C Y = 0) + q(S, T |C Y = 1)q(C Y = 1) = p(S, T |Y ) p(S, T |Y ) + N p(S|Y )p(T |Y ) . ( ) Then, we bridge a connection with mutual information as: log q(C Y = 1|S, T ) = log p(S, T |Y ) p(S, T |Y ) + N p(S|Y )p(T |Y ) = -log 1 + N p(S|Y )p(T |Y ) p(S, T |Y ) ≤ -log N + log p(S, T |Y ) p(S|Y )p(T |Y ) . Taking expectation on both sides w.r.t. p(S, T |Y ) and rearranging, we have: I(S; T |Y ) ≥ log N + E q(S,T |C Y =1) log q(C Y = 1|S, T ), where I(S; T |Y ) is the conditional mutual information between the distribution of visual representations and word vectors under the same one-hot label. Thus, maximizing E q(S,T |C Y =1) log q(C Y = 1|S, T ) w.r.t. the parameters of the objective model is to increase a lower bound on mutual information. However, it is intractable to directly optimize such a lower bound since we do not know the true distribution q(C Y = 1|S, T ). To achieve a tractable objective, We introduce the following Lemma: In equation 21, maximizing E q(S,T |C Y =1) log q(C Y = 1|S, T ) w.r.t. the parameters of the objective model is to increase a lower bound on mutual information. However, it is intractable to directly optimize such a lower bound since we do not know the true distribution q(C Y = 1|S, T ). Thus, we estimate it by fitting a model h : {S, T }-> [0, 1] to samples from the data distribution q(C Y = 1|S, T ). We maximize the log-likelihood of the data under this model (a binary classification problem): L est (h) = E q(S,T |C Y =1) [log h(S, T )] + N E q(S,T |C Y =0) [log (1 -h(S, T ))] . (22) h ⋆ = arg max h L est (h). According to Gibbs' inequality, we have: q(C Y = 1|S, T ) = h ⋆ = arg max h L est (h). ( ) The details of equation 24 can be found in Appendix B. Thus, we can rewrite equation 21 in terms of h ⋆ : I(S; T |Y ) ≥ log N + E q(S,T |C Y =1) [log h ⋆ ], The optimal h ⋆ is an estimator whose expectation lower-bounds mutual information. Our goal is to learn a model F θ that maximizes the mutual information between visual representations and corresponding semantic word vectors. As a result, we have the following optimization problem: F θ = arg max θ E q(S,T |C Y =1) [log h ⋆ (S, T )] . However, this is still difficult to optimize, since the estimator h ⋆ depends on the current model F θ . To hand this problem, we further relax the bound in equation 25 to: I(S; T |Y ) ≥ log N + E q(S,T |C Y =1) [log h ⋆ (S, T )] + N E q(S,T |C Y =0) [log (1 -h ⋆ (S, T ))] = log N + L est (h ⋆ ) = log N + max h L est (h) ≥ log N + L est (h). ( ) Since the N E S,T |C Y =0) [log (1 -h ⋆ (S, ))] is strictly negative, the inequality still holds by adding such term into equation 26. Then, optimizing equation 27 w.r.t. the F θ can be reformulated as follows: F θ = arg max θ max h L est (h) = arg max θ max h E q(S,T |C Y =1) [log h(S, T )] + N E q(S,T |C Y =0) [log (1 -h(S, T ))] . At last, equation 28 is our final learning objective, which jointly optimizes F θ together with learning h. Note that, due to equation 27, F θ = arg max θ max h L est (h), for any h, is always the lower bound on the mutual information, which means our objective equation 28 does not depend on h being optimized perfectly. Therefore, we define h with any family of functions that satisfy h : {S, T } → [0, 1]. In practice, we define the h as follows: h(S, T ) = e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c where c is the cardinality of the dataset and τ is a temperature that adjusts the concentration level. g s and g t are nonlinear projection heads to transform the representation into the same manifold space and further normalize by L2-norm. Therefore, I(S; T | Y ) is lower bounded by -L inf o = E q(S,T |C Y =1) [log e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c ] + N E q(S,T |C Y =0) [log(1 - e gs(s) ′ gt(t)/τ e gs(s) ′ gt(t)/τ + c )], B THE PROOF OF EQUATION 24 Proof. In equation 24, the first item q(C = 1|S, T ) presents the true distribution of the data that is from the same one-hot label, where C is a binary variable to judge whether the label is correct. Therefore, it is intuitive that the q(C|S, T ) can be modeled as a Bernoulli distribution, such as h : (S, T ) → [0, 1]. For convenience, we define h ′ (S, T, C = 1) = h(S, T ) and h ′ (S, T, C = 0) = 1 -h(S, T ), then the log-likelihood is: E c∈q(C|S,T ) [log h ′ (S, T, C = c)] . By using Gibbs' inequality, the max likelihood fit is h ′ (S, T, C = c) = q(C = c|S, T ), which also implies that h(S, T ) = q(C = 1|S, T ). Then, we rewrite our objective in equation 22 as follows: E s,t∈q(S,T ) E c∈q(C|S=s,T =t) [log h ′ (S = s, T = t, C = c)] (32) = E c,s,t∈q(C,S,T ) [log h ′ (S = s, T = t, C = c)] (33) = E s,t∈q(S,T |C=1)q(C=1) [log h(S = s, T = t)] + E s,t∈q(S,T |C=0)q(C=0) [log 1 -h(S = s, T = t)] (34) = 1 N + 1 E s,t∈q(S,T |C=1) [log h(S = s, T = t] + N N + 1 E s,t∈q(S,T |C=0) [log 1 -h(S = s, T = t] Notice that equation 35 is proportional to equation 22 from the Appendix A. For sufficiently expressive h, then, each term inside the expectation in equation 32 can be maximized, resulting in h ⋆ (S = s, T = t) = q(C = 1|S = s, T = t) for all s and t.

C PERFORMANCE UNDER WIDERESNET

Many works have demonstrated larger model capacity can usually lead to better adversarial robustness (Madry et al., 2018; Gowal et al., 2020; Pang et al., 2021) . Therefore, we employ the large-capacity network, e.g.,, Wide ResNet (Zagoruyko & Komodakis, 2016) . Table 6 reports the best test robustness against AA on the CIFAR-10. We compare several state-of-the-art adversarial trained models on robust benchmark (Croce et al., 2020) . our SCARL achieves 54.65% and 56.42% with AWP respectively, which makes the trained model surpass the previously state-of-the-art models reported by the benchmark. where every small margin of improvement is significant. Notes, our experiments did not use additional datasets. Table 6 : Robustness accuracy comparison of the proposed approach and several state-of-the-art models under AA at ℓ ∞ norm with ϵ = 8/255 on CIFAR-10. Most of the results are directly copied from the leaderboards (Croce & Hein, 2020) . ⋆ indicates the model is re-produced by ourselves. 

E RELATED WORK

The problem of adversarial examples was first studied in (Szegedy et al., 2014) . Then, many works proposed a series of adversarial attack methods (Moosavi-Dezfooli et al., 2016; Papernot et al., 2016a; Carlini & Wagner, 2017; Croce & Hein, 2020) , which puts severe limitations on the application of deep learning in security-critical scenarios. With the rapid development of attack methods, considerable efforts have been devoted to defending against adversarial attacks, such as defensive distillation (Papernot et al., 2016b) , manifold-projection (Samangouei et al., 2018) , pre-processing (Guo et al., 2018; Yang et al., 2019) , verification and provable defences (Raghunathan et al., 2018; Salman et al., 2019), and Adversarial Training (Madry et al., 2018; Zhang et al., 2019) . Among them, adversarial training has been demonstrated to be a practical approach for strengthening the robustness of deep neural networks (Athalye et al., 2018) . Adversarial training involves the minmax optimization problem as Eq. equation 1. The inner maximization can be solved approximately, using FGSM or PGD attack. The outer minimization can be achieved by minimizing cross-entropy loss instead. Based on that, a number of new adversarial training methods have also been devoted from different aspects including designing new adversarial regularization (Zhang et al., 2019; Mao et al., 2019) , robustness architecture search (Guo et al., 2020; Hosseini et al., 2021) , training strategy (Wong et al., 2020; Pang et al., 2021) and data augmentation (Carmon et al., 2019; Rebuffi et al., 2021) . To the best of our knowledge, we are the first to explore the impact of semantic information for adversarial training.



The CIFAR-10 contains 10 categories: airplane (0), car (1), bird (2), cat (3), deer (4), dog (5), frog (6), horse (7), ship (8), truck (9). Due to space limitations, we show the visualization of CCA in the supplementary material.



Figure 1: The canonical correlation analysis (CCA) of the natural and adversarial image with the semantic words under the natural and adversarial trained model, respectively. In each plot, we sample 500 image-words pairs to calculate the correlation coefficient. * indicates the model is trained by FGSM, it's less robust than standard adversarial training (d). The larger the CCA, the stronger the correlation between the visual representation and semantic word vector.

Figure 2: The similarity matrix between different categories of features learned by different models on CIFAR-10, Different numbers represent different categories. The similarity is calculated by operate inner product between different categories of normalized features. The color is brighter with a larger similarity.

Figure 4: (a) is the test accuracy curves (under PGD-10). (b) and (c) is the test accuracy under PGD attack with different attack budgets and attack iterations, respectively. (d) is the test accuracy under different batch sizes. All these experiments were conducted on CIFAR-10.

Figure 5: The canonical correlation analysis (CCA) of the natural and adversarial image with the semantic words under the non-robust and robust model, respectively. In each plot, we sample 500 image-words pairs to calculate the correlation coefficient. The larger the CCA, the stronger the correlation between the visual representation and semantic word vector.

Robustness accuracy comparison of the proposed approach and baseline models under different attack methods under the ℓ ∞ norm with ϵ = 8/255 on different datasets. All the models are based on pre-activation ResNet-18 architecture. We choose the best checkpoint according to the highest robust accuracy on the test set under PGD-10. The best results are blodfaced.

The robustness results in CIFAR-10 under the ℓ ∞ norm with ϵ = 8/255.

Robustness comparison of different adversarially trained models under adaptive attacks.

Impact of different techniques of our proposed method. Base indicates the model trained with equation 12.

The effect of different word embedding. Base indicates the model trained with equation 12.

annex

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan.Theoretically principled trade-off between robustness and accuracy. In ICML, 2019. we combined the semantic information (SemInfo) with TRADES, the balance parameter β set as 1 and 6. The results are shown above. We can see the robustness can be further improved. 

