TARGETED ADVERSARIAL SELF-SUPERVISED LEARNING

Abstract

Recently, unsupervised adversarial training (AT) has been extensively studied to attain robustness with the models trained upon unlabeled data. To this end, previous studies have applied existing supervised adversarial training techniques to selfsupervised learning (SSL) frameworks. However, all have resorted to untargeted adversarial learning as obtaining targeted adversarial examples is unclear in the SSL setting lacking of label information. In this paper, we propose a novel targeted adversarial training method for the SSL frameworks, especially for positive-pairs in SSL framework. Specifically, we propose a target selection algorithm for the adversarial SSL frameworks; it is designed to select the most confusing sample for each given instance based on similarity and entropy, and perturb the given instance toward the selected target sample. Our method significantly enhances the robustness of a positive-only SSL model without requiring large batches of images or additional models, unlike existing works aimed at achieving the same goal. Moreover, our method is readily applicable to general SSL frameworks that only uses positive pairs. We validate our method on benchmark datasets, on which it obtains superior robust accuracies, outperforming existing unsupervised adversarial training methods.

1. INTRODUCTION

Enhancing the robustness of deep neural networks (DNN) is a critical challenge for their real-world applications. DNNs have been known to be vulnerable to adversarial attacks using imperceptible perturbations (Goodfellow et al., 2015) , corrupted images (Hendrycks & Dietterich, 2019) , and images with shifted distributions (Koh et al., 2021) , which cause the attacked DNN models to perform incorrect predictions. A vast volume of prior studies has proposed to leverage adversarial training (AT) (Madry et al., 2018) ; AT explicitly uses generated adversarial examples with specific types of perturbations (e.g., ℓ ∞ -norm attack) when training a DNN model. Most of these previous AT studies have considered supervised learning settings (Madry et al., 2018; Zhang et al., 2019; Wu et al., 2020; Wang et al., 2019) in which we can utilize class label information to generate adversarial examples. On the other hand, achieving robustness in a self-supervised learning (SSL) setting has been relatively understudied despite the recent success of SSL in a variety of tasks and domains. SSL frameworks (Dosovitskiy et al., 2015; Zhang et al., 2016; Tian et al., 2020b; Chen et al., 2020; He et al., 2020; Grill et al., 2020; Chen & He, 2021) have been proposed to learn transferable visual representations by solving for pretext tasks constructed out of the training data (Dosovitskiy et al., 2015; Zhang et al., 2016) . A popular SSL approach is contrastive learning (e.g., SimCLR (Chen et al., 2020 ), MoCo (He et al., 2020) ), which learns to maximize the similarity across positive pairs, each of which contains differently augmented samples of the same instance, while minimizing the similarity across different intances. Recently, to establish robustness in these SSL frameworks, RoCL (Kim et al., 2020) and ACL (Jiang et al., 2020) have proposed adversarial SSL methods based on contrastive learning frameworks. They have demonstrated improved robustness without leveraging any labeled data. However, both of these adversarial SSL frameworks are inefficient as they require a large batch size in order to attain good performances either on clean or adversarial samples. Recent SSL frameworks (Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) mostly resort to maximizing the consistency across two differently augmented samples of the same instance, using an additional momentum encoder (Grill et al., 2020) , without any negative pairs or additional networks (Chen & He, 2021; Zbontar et al., 2021) . Such non-contrastive SSL frameworks using only positive pairs are shown to obtain representations with superior generalization performance compared to contrastive counterparts in a more efficient manner. However, leveraging untargeted adversarial attacks in these SSL frameworks results in a suboptimal performance. BYORL (Gowal et al., 2021a) , an adversarial SSL framework using only positive pairs, obtains much lower robust accuracies than those of adversarial contrastive-learning SSL methods on the benchmark datasets (Table 3 ). Then, what is the cause of such suboptimal robustness in a non-contrastive adversarial SSL framework? We observe that this limited robustness mainly comes from the suboptimality of untargeted attacks; adversarial examples generated by the deployed untargeted attacks are ineffective in improving robustness in non-contrastive adversarial SSL frameworks. As shown in Figure 1 , the attack in the inner loop of the adversarial training loss, designed to minimize the distance between two differently augmented samples, perturbs a given example to a random position in the latent space. Thus, the generated adversarial samples have little impact on the final decision boundaries. Contrarily, in contrastive SSL frameworks, the samples are perturbed toward negative samples to maximize the instance classification loss, most of which belong to different classes. Thus, the ineffectiveness of the untargeted attacks in non-contrastive SSL frameworks mostly comes from their inconsideration of other instances. To tackle this issue, we propose Targeted Attack for RObust Self-Supervised learning (TAROSS). TAROSS is designed to enhance robustness of a non-contrastive SSL framework with only positive pairs, such as BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021), by conducting targeted attacks, which perturbs the given instance toward a target. However, this leads to the question of which direction we want to perform the targeted attack to, that is unclear in unsupervised adversarial learning without class labels. To address this point, we consider attacking the instance toward another instance, and further perform an empirical study of which target instances help enhance robustness as opposed to randomly selected target instances, in a targeted attack. Based on our observation, we propose a simple yet effective target selection algorithm based on the similarity and entropy between instances. Our main contributions can be summarized as follows: • We demonstrate that achieving comparable robustness in the positive-only self-supervised learning (SSL) with contrastive-based SSL is difficult due to ineffective adversarial inputs generated by untargeted attacks. • We perform an empirical study of different targeted attacks for non-contrastive adversarial SSL frameworks using only positive pairs. Then, based on the observation, we propose a novel targeted adversarial attack scheme which perturbs the target sample in the direction to the most confusing instance to it, based on similarity and entropy. • We experimentally confirm that the proposed targeted adversarial SSL framework is able to obtain significantly high robustness, outperforming the state-of-the-art contrastive-and positive-only adversarial SSL methods.



Figure 1: Motivation. In supervised adversarial learning (a), perturbation is generated to maximize the crossentropy loss, which will push adversarial examples to the decision boundaries of other classes. In adversarial contrastive SSL (b), perturbation is generated to minimize the similarity (red line) between positive pairs while maximizing the similarity (blue lines) between negative pairs. Then, the adversarial examples may be pushed to the space of other classes as negative samples may mostly contain samples from other classes. However, in positive-only adversarial SSL (c), minimizing the similarity (red) between positive pairs have weaker constraints in generating effective adversarial examples than supervised AT or contrastive-based SSL. To overcome such a limitation, we suggest a selective targeted attack that maximizes the similarity (blue) to the most confusing target instance (yellow square in (c)).

